Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make expression binding support a case sensitivity flag #82

Merged

Conversation

xabriel
Copy link
Contributor

@xabriel xabriel commented Jan 17, 2019

Iceberg's current implementation has column case sensitivity, which hinders usability, as most sql users expect case insensitivity by default. While a query like the following will succeed in other Spark Readers, it will fail on Iceberg:

SELECT COUNT(*)
FROM iceTable
WHERE year = 2017
  AND MONTH = 11 -- Notice how MONTH has different casing than other predicates
  AND day = 01

This will fail with a stack trace similar to:

com.google.common.util.concurrent.UncheckedExecutionException: com.netflix.iceberg.exceptions.ValidationException: Cannot find field 'MONTH' in struct: struct<...>
...

In this PR, we solve this by making iceberg-api case-insensitive when binding expressions.

Some further notes:

  • We could also solve this in iceberg-spark, however, that implies we would have to solve it in any other engine that supports iceberg ( presto, pig, etc ).

  • Enabling case insensitivity implies that a table can not have columns were LOWER(a) == LOWER(b). I have a TODO in the PR as maybe this change makes more sense under a feature flag.

@rdblue
Copy link
Contributor

rdblue commented Jan 18, 2019

Thanks for catching this, @xabriel. Looks like a reasonable set of changes to me.

The only problem I have is that this should be configurable for systems that are case sensitive. If Iceberg should not impose case sensitivity on processing engines then it shouldn't impose case insensitivity either.

I think this needs that configuration before the binder behavior change can be committed. The binder should probably accept a case sensitivity boolean argument, and that should be passed through to make this work in Spark.

If you want, we could separate this into the changes here other than updating the behavior of Binder, then add the rest in a follow-up. Up to you how you want to move forward.

@rdblue
Copy link
Contributor

rdblue commented Jan 18, 2019

I should also note that we could also update Spark to pass the correct case back to v2 sources. That's probably a good idea either way.

@xabriel
Copy link
Contributor Author

xabriel commented Jan 19, 2019

Thanks for the feedback, @rdblue.

If Iceberg should not impose case sensitivity on processing engines then it shouldn't impose case insensitivity either.

Agreed.

I'll go ahead and update the PR to:

  1. Update Binder#bind to accept a caseSensitive parameter. Update tests accordingly.
  2. Not change default behavior.

As suggested, I'll submit a follow up PR for using the new functionality.

@xabriel xabriel force-pushed the make-expression-binding-case-insensitive branch from 48407e3 to 606ab12 Compare January 19, 2019 02:31
@xabriel xabriel changed the title Make expression binding case insensitive Make expression binding support a case sensitivity flag Jan 19, 2019
@rdblue
Copy link
Contributor

rdblue commented Jan 22, 2019

Thanks for the update, @xabriel!

Adding a caseSensitive flag requires quite a few updates elsewhere, and method calls are less readable because of the boolean argument (we'll forget what true signals). What do you think about adding versions of these methods that don't have the flag and default to case sensitive?

The drawback is that it will be easy to not pay attention later and use case sensitive matching when it shouldn't be. But I think that would be okay. At least for tests, a package-private version that defaults to case sensitive would cut down on a lot of code changes.

What do you think?

@xabriel
Copy link
Contributor Author

xabriel commented Jan 22, 2019

What do you think about adding versions of these methods that don't have the flag and default to case sensitive?

I do think that that users of Binder#bind should be forced to make a choice, otherwise iceberg will evolve, get new committers, and some future PR will use default behavior from Binder#bind(StructType struct, Expression expr), thus inadvertently not honoring the new configuration flag that would come in a follow up PR to this one.

At least for tests, a package-private version that defaults to case sensitive would cut down on a lot of code changes.

A package-private version its ok with me if you think it will make this PR easier to follow. No problem.

So to summarize:

  1. We will have Binder#bind(StructType struct, Expression expr) as package-private, thus existing tests will change minimally.
  2. Binder#bind(StructType struct, Expression expr, boolean caseSensitive) to be used by all non-test callers.

Agreed?

@rdblue
Copy link
Contributor

rdblue commented Jan 23, 2019

@xabriel, sounds good to me. Thanks!

@rdblue rdblue merged commit 022fe36 into apache:master Jan 23, 2019
@rdblue
Copy link
Contributor

rdblue commented Jan 23, 2019

Merged. Thank you for contributing this, @xabriel!

renato2099 pushed a commit to renato2099/incubator-iceberg that referenced this pull request Jan 30, 2019
This doesn't change default behavior. Configuring case sensitivity for processing engines will be added in future commits.
rdblue referenced this pull request in rdblue/iceberg Apr 10, 2019
This doesn't change default behavior. Configuring case sensitivity for processing engines will be added in future commits.
rdblue referenced this pull request in rdblue/iceberg May 14, 2019
This doesn't change default behavior. Configuring case sensitivity for processing engines will be added in future commits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants