-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schema evolution support #745
Conversation
From #741, it looks like the problem this is trying to address is that you can't write to a table with a different column order than the order of the table schema. The reason for this restriction is that Spark should be responsible for reconciling table columns with the data from a query. Spark has two different modes for doing this: by position for SQL, and by name for DataFrame writes. I think that delegating this to Spark is the right long-term solution. In the short term, Spark 2.4 has no resolution step for v2 writes. That's why Iceberg has the current checks that it does, so that you don't corrupt a table by writing the wrong thing. I'd rather not extend the current checks if we don't need to, given that this will be handled by Spark in the next version. Is that reasonable? |
Hi @rdblue, I'm working with @ravichinoy on this issue and had a couple of follow up questions:
|
I don't have correctness concerns. Handling field reordering is a requirement, and Spark recently added DDL statements to reorder fields. Mainly, I wonder if it is a good idea to write data files with a column order that depends on the incoming query instead of the table's current column order. Reordering columns can change performance in Avro and it may be surprising when people look at a file in the table and see columns in an unexpected order. These aren't correctness blockers, but if this is going to be moot in the next Spark release then my thinking was that we don't need to choose what the behavior should be.
No. And now that you mentioned your concerns, I think that's probably a good idea to use the incoming schema. |
@@ -78,7 +78,7 @@ | |||
|
|||
private CheckCompatibility(Schema schema, boolean checkOrdering, boolean checkNullability) { | |||
this.schema = schema; | |||
this.checkOrdering = checkOrdering; | |||
this.checkOrdering = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should still pass checkOrdering
correctly. To fix the problem this is trying to address, I think this should add a write option to that gets passed into this.
@@ -491,10 +491,10 @@ public void write(InternalRow row) throws IOException { | |||
AppenderFactory<InternalRow> appenderFactory, | |||
WriterFactory.OutputFileFactory fileFactory, | |||
FileIO fileIo, | |||
long targetFileSize) { | |||
long targetFileSize, | |||
Schema schema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this writeSchema
or some name that indicates it is the schema of the incoming data, not necessarily the table schema?
@@ -132,7 +132,7 @@ public void testNullPartitionValue() throws Exception { | |||
|
|||
try { | |||
// TODO: incoming columns must be ordered according to the table's schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to remove this TODO.
@ravichinoy & @davrmac, I think we can get this in if we add a write config option that controls whether to do the order check, since we already have a boolean for it. We just need to pass it in from writes and then it's up to the user whether to require the same column order as the table. Does that work for you? |
5ad1135
to
9a25c9b
Compare
List<String> errors; | ||
if (checkNullability) { | ||
errors = CheckCompatibility.writeCompatibilityErrors(tableSchema, dsSchema); | ||
errors = CheckCompatibility.writeCompatibilityErrors(tableSchema, dsSchema, checkOrdering); | ||
} else { | ||
errors = CheckCompatibility.typeCompatibilityErrors(tableSchema, dsSchema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ravichinoy checkOrdering needs to be passed here as well. You should be able to turn off both the nullability and ordering checks.
ec2aa36
to
28c5297
Compare
* @param checkOrdering If false, allow input schema to have different ordering than table schema | ||
* @return a list of error details, or an empty list if there are no compatibility problems | ||
*/ | ||
public static List<String> writeCompatibilityErrors(Schema readSchema, Schema writeSchema, Boolean checkOrdering) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a boxed boolean?
@@ -189,12 +189,13 @@ private static void mergeIcebergHadoopConfs( | |||
.forEach(key -> baseConf.set(key.replaceFirst("hadoop.", ""), options.get(key))); | |||
} | |||
|
|||
private void validateWriteSchema(Schema tableSchema, Schema dsSchema, Boolean checkNullability) { | |||
private void validateWriteSchema( | |||
Schema tableSchema, Schema dsSchema, Boolean checkNullability, Boolean checkOrdering) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is getting too complicated. We now have 2 boolean options, one of which is used to select the compatibility checking method (which have similar names) and the other is passed as an arg (which isn't readable). I think it is a good idea to convert compatibility checking to a builder-like pattern:
CheckCompatibility
.writeSchema(dsSchema)
.readSchema(tableSchema)
.checkOrdering(true)
.checkNullability(false)
.throwOnValidationError();
@ravichinoy, thanks for working on this. It looks good. Can you either fix the methods that use a boxed Boolean or implement the builder-like API I suggested? I can add the builder in a follow-up if you want to get this in more quickly. |
make check ordering configurable use input schema to build accessors
28c5297
to
3978431
Compare
Thanks for updating this. I merged it. |
thanks @rdblue we will try to implement the Builder pattern as part of follow up PR. |
Please refer Issue #741 for details