schema evolution support #745

ravichinoy · 2020-01-21T21:00:14Z

skip check ordering
use input schema to build accessors

Please refer Issue #741 for details

rdblue · 2020-01-22T23:07:46Z

From #741, it looks like the problem this is trying to address is that you can't write to a table with a different column order than the order of the table schema.

The reason for this restriction is that Spark should be responsible for reconciling table columns with the data from a query. Spark has two different modes for doing this: by position for SQL, and by name for DataFrame writes. I think that delegating this to Spark is the right long-term solution.

In the short term, Spark 2.4 has no resolution step for v2 writes. That's why Iceberg has the current checks that it does, so that you don't corrupt a table by writing the wrong thing. I'd rather not extend the current checks if we don't need to, given that this will be handled by Spark in the next version. Is that reasonable?

davrmac · 2020-01-23T20:55:21Z

Hi @rdblue, I'm working with @ravichinoy on this issue and had a couple of follow up questions:

For the compatibility check change is your concern just about not updating a code-path that's about to be deprecated, or do you also have correctness concerns about dropping the ordering check? From my understanding of the Spark-to-Iceberg write path, Spark has already bound field names to all of the columns being written by the time it calls the IcebergSource to create a writer, and Iceberg uses those field names (and not the field order) to perform the mapping back to the corresponding fieldIds in the Iceberg schema (which it then uses to perform the rest of its compatibility checks). Are there other parts of the Iceberg write path that still depend on field order that would cause corruptions if the ordering check wasn't enforced?
Do you have any concerns with the second half of @ravichinoy's change (i.e. switching from using the Iceberg table schema to the Spark schema when building the PartitionKey accessor)? It's less relevant if we're still enforcing that the Spark schema match the field order of the Iceberg table schema, but I believe there are still scenarios where, if the Spark schema is missing optional fields that are defined in the Iceberg table schema, the write can pass the compatibility check but build a PartitionKey accessor that fails to pull the correct fields out of each InternalRow (we can try writing a test case to prove whether or not this is actually an issue).

rdblue · 2020-01-24T00:49:21Z

Do you also have correctness concerns about dropping the ordering check?

I don't have correctness concerns. Handling field reordering is a requirement, and Spark recently added DDL statements to reorder fields.

Mainly, I wonder if it is a good idea to write data files with a column order that depends on the incoming query instead of the table's current column order. Reordering columns can change performance in Avro and it may be surprising when people look at a file in the table and see columns in an unexpected order. These aren't correctness blockers, but if this is going to be moot in the next Spark release then my thinking was that we don't need to choose what the behavior should be.

Do you have any concerns with the second half of @ravichinoy's change (i.e. switching from using the Iceberg table schema to the Spark schema when building the PartitionKey accessor)?

No. And now that you mentioned your concerns, I think that's probably a good idea to use the incoming schema.

api/src/test/java/org/apache/iceberg/types/TestReadabilityChecks.java

rdblue · 2020-01-24T01:01:11Z

api/src/main/java/org/apache/iceberg/types/CheckCompatibility.java

@@ -78,7 +78,7 @@

  private CheckCompatibility(Schema schema, boolean checkOrdering, boolean checkNullability) {
    this.schema = schema;
-    this.checkOrdering = checkOrdering;
+    this.checkOrdering = false;


I think this should still pass checkOrdering correctly. To fix the problem this is trying to address, I think this should add a write option to that gets passed into this.

spark/src/main/java/org/apache/iceberg/spark/source/PartitionKey.java

rdblue · 2020-01-24T01:02:30Z

spark/src/main/java/org/apache/iceberg/spark/source/Writer.java

@@ -491,10 +491,10 @@ public void write(InternalRow row) throws IOException {
        AppenderFactory<InternalRow> appenderFactory,
        WriterFactory.OutputFileFactory fileFactory,
        FileIO fileIo,
-        long targetFileSize) {
+        long targetFileSize,
+        Schema schema) {


Can we call this writeSchema or some name that indicates it is the schema of the incoming data, not necessarily the table schema?

rdblue · 2020-01-24T01:02:52Z

spark/src/test/java/org/apache/iceberg/spark/source/TestPartitionValues.java

@@ -132,7 +132,7 @@ public void testNullPartitionValue() throws Exception {

    try {
      // TODO: incoming columns must be ordered according to the table's schema


We should be able to remove this TODO.

rdblue · 2020-01-24T01:14:55Z

@ravichinoy & @davrmac, I think we can get this in if we add a write config option that controls whether to do the order check, since we already have a boolean for it. We just need to pass it in from writes and then it's up to the user whether to require the same column order as the table. Does that work for you?

davrmac · 2020-01-29T07:42:07Z

spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java

    List<String> errors;
    if (checkNullability) {
-      errors = CheckCompatibility.writeCompatibilityErrors(tableSchema, dsSchema);
+      errors = CheckCompatibility.writeCompatibilityErrors(tableSchema, dsSchema, checkOrdering);
    } else {
      errors = CheckCompatibility.typeCompatibilityErrors(tableSchema, dsSchema);


@ravichinoy checkOrdering needs to be passed here as well. You should be able to turn off both the nullability and ordering checks.

rdblue · 2020-02-12T00:47:04Z

api/src/main/java/org/apache/iceberg/types/CheckCompatibility.java

+   * @param checkOrdering If false, allow input schema to have different ordering than table schema
+   * @return a list of error details, or an empty list if there are no compatibility problems
+   */
+  public static List<String> writeCompatibilityErrors(Schema readSchema, Schema writeSchema, Boolean checkOrdering) {


Why is this a boxed boolean?

rdblue · 2020-02-12T00:52:56Z

spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java

@@ -189,12 +189,13 @@ private static void mergeIcebergHadoopConfs(
        .forEach(key -> baseConf.set(key.replaceFirst("hadoop.", ""), options.get(key)));
  }

-  private void validateWriteSchema(Schema tableSchema, Schema dsSchema, Boolean checkNullability) {
+  private void validateWriteSchema(
+          Schema tableSchema, Schema dsSchema, Boolean checkNullability, Boolean checkOrdering) {


It looks like this is getting too complicated. We now have 2 boolean options, one of which is used to select the compatibility checking method (which have similar names) and the other is passed as an arg (which isn't readable). I think it is a good idea to convert compatibility checking to a builder-like pattern:

CheckCompatibility .writeSchema(dsSchema) .readSchema(tableSchema) .checkOrdering(true) .checkNullability(false) .throwOnValidationError();

rdblue · 2020-02-12T00:54:17Z

@ravichinoy, thanks for working on this. It looks good.

Can you either fix the methods that use a boxed Boolean or implement the builder-like API I suggested? I can add the builder in a follow-up if you want to get this in more quickly.

make check ordering configurable use input schema to build accessors

rdblue · 2020-02-13T21:00:11Z

Thanks for updating this. I merged it.

ravichinoy · 2020-02-13T21:00:30Z

thanks @rdblue we will try to implement the Builder pattern as part of follow up PR.

ravichinoy mentioned this pull request Jan 21, 2020

Spark DataFrame write fails if input dataframe has columns in different order than iceberg schema #741

Closed

ravichinoy requested review from aokolnychyi and rdblue January 21, 2020 22:20

rdblue reviewed Jan 24, 2020

View reviewed changes

api/src/test/java/org/apache/iceberg/types/TestReadabilityChecks.java Show resolved Hide resolved

rdblue reviewed Jan 24, 2020

View reviewed changes

spark/src/main/java/org/apache/iceberg/spark/source/PartitionKey.java Show resolved Hide resolved

rdblue reviewed Jan 24, 2020

View reviewed changes

ravichinoy force-pushed the schema_evolution_support branch 4 times, most recently from 5ad1135 to 9a25c9b Compare January 29, 2020 06:52

davrmac reviewed Jan 29, 2020

View reviewed changes

ravichinoy force-pushed the schema_evolution_support branch from ec2aa36 to 28c5297 Compare January 30, 2020 05:08

ravichinoy requested a review from rdblue February 3, 2020 18:03

rdblue reviewed Feb 12, 2020

View reviewed changes

schema evolution support

3978431

make check ordering configurable use input schema to build accessors

ravichinoy force-pushed the schema_evolution_support branch from 28c5297 to 3978431 Compare February 13, 2020 20:01

rdblue merged commit b35899f into apache:master Feb 13, 2020

wangmiao1981 mentioned this pull request Apr 3, 2020

Out of order fields commit fails in Iceberg #886

Closed

jun-ma-0 pushed a commit to jun-ma-0/incubator-iceberg that referenced this pull request May 11, 2020

Spark: Make column order check optional (apache#745)

ecd1cd9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema evolution support #745

schema evolution support #745

ravichinoy commented Jan 21, 2020 •

edited

Loading

rdblue commented Jan 22, 2020

davrmac commented Jan 23, 2020

rdblue commented Jan 24, 2020

rdblue Jan 24, 2020

rdblue Jan 24, 2020

rdblue Jan 24, 2020

rdblue commented Jan 24, 2020

davrmac Jan 29, 2020

rdblue Feb 12, 2020

rdblue Feb 12, 2020

rdblue commented Feb 12, 2020

rdblue commented Feb 13, 2020

ravichinoy commented Feb 13, 2020

		@@ -132,7 +132,7 @@ public void testNullPartitionValue() throws Exception {

		try {
		// TODO: incoming columns must be ordered according to the table's schema

schema evolution support #745

schema evolution support #745

Conversation

ravichinoy commented Jan 21, 2020 • edited Loading

rdblue commented Jan 22, 2020

davrmac commented Jan 23, 2020

rdblue commented Jan 24, 2020

rdblue Jan 24, 2020

Choose a reason for hiding this comment

rdblue Jan 24, 2020

Choose a reason for hiding this comment

rdblue Jan 24, 2020

Choose a reason for hiding this comment

rdblue commented Jan 24, 2020

davrmac Jan 29, 2020

Choose a reason for hiding this comment

rdblue Feb 12, 2020

Choose a reason for hiding this comment

rdblue Feb 12, 2020

Choose a reason for hiding this comment

rdblue commented Feb 12, 2020

rdblue commented Feb 13, 2020

ravichinoy commented Feb 13, 2020

ravichinoy commented Jan 21, 2020 •

edited

Loading