Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19912][SQL] String literals should be escaped for Hive metastore partition pruning #17266

Closed
wants to merge 1 commit into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Mar 12, 2017

What changes were proposed in this pull request?

Since current HiveShim's convertFilters does not escape the string literals. There exists the following correctness issues. This PR aims to return the correct result and also shows the more clear exception message.

BEFORE

scala> Seq((1, "p1", "q1"), (2, "p1\" and q=\"q1", "q2")).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("t1")
                                                                                
scala> spark.table("t1").filter($"p" === "p1\" and q=\"q1").select($"a").show
+---+
|  a|
+---+
+---+

scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from ...

AFTER

scala> spark.table("t1").filter($"p" === "p1\" and q=\"q1").select($"a").show
+---+
|  a|
+---+
|  2|
+---+

scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
java.lang.UnsupportedOperationException: Partition filter cannot have both `"` and `'` characters

How was this patch tested?

Pass the Jenkins test with new test cases.

s"""'$str'"""
} else {
throw new UnsupportedOperationException(
"""Partition filter cannot have both `"` and `'` characters""")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current master also raise exception with this mixed case.

scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from ...
...
Caused by: java.lang.reflect.InvocationTargetException: org.apache.hadoop.hive.metastore.api.MetaException: Error parsing partition filter : line 1:8 mismatched character '<EOF>' expecting '"'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table.filter($"p" === """a"'b""").select($"a")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that return the correct result?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Mar 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the current master branch, Hive does not accept that.

scala> Seq((1, "a\"'b")).toDF("a", "p").write.partitionBy("p").saveAsTable("t1")

scala> spark.table("t1").show()
+---+----+
|  a|   p|
+---+----+
|  1|a"'b|
+---+----+

scala> spark.table("t1").filter($"p" === """a"'b""").select($"a").show
java.lang.RuntimeException:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We got the exception from Hive, because they are not escaped.

@SparkQA
Copy link

SparkQA commented Mar 12, 2017

Test build #74400 has finished for PR 17266 at commit 37bff5d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

We should add the escape, instead of adding quotes. Right?

@gatorsmile
Copy link
Member

Based on Hive' doc (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types):

String literals can be expressed with either single quotes (') or double quotes ("). Hive uses C-style escaping within the strings.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Mar 12, 2017

Yep. I tried escaping first, but it doesn't work inside Hive side. I mean for the mixed cases.
Also, I tried concatenation style '"'"'" for "' because it works in Hive. But, for filter parsing, it fails at Hive Filter Lexer layer.

@dongjoon-hyun
Copy link
Member Author

BTW, I forgot to thank you. :) Thank you for review.
For the non-mixed cases, I think we don't need to escape.

@gatorsmile
Copy link
Member

Have you tried C-style escaping? "' -> \"\'

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Mar 13, 2017

Yes. What I meant was that is not supported correctly from Hive (as described in its documentation).
For the mixed case, all combinations of the escaping does not work as indented.

You can test by changing here with the escaped string.

@gatorsmile
Copy link
Member

Could you try Hive to double check it? Is this a bug in Hive?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Mar 13, 2017

It's not bug of Hive CLI, it seems a limitation of that API, getPartitionsByFilterMethod.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Mar 13, 2017

If you want, I will make some other failure test cases in this PR to make it sure for you and the others.
Maybe, about getPartitionsByFilterMethod.

@gatorsmile
Copy link
Member

Ok, how about submitting a separate PR by escaping the string? You can show the reviewers the failure cases there.

@dongjoon-hyun
Copy link
Member Author

Yep. I see. Thank for the guide, @gatorsmile !

@dongjoon-hyun
Copy link
Member Author

#17271 failed as expected. Hive API does not handle the filters with escaped string, e.g. two escaped chars like \"\".

@gatorsmile
Copy link
Member

Could you do more investigation about the impact of the following two Hive JIRAs?

https://issues.apache.org/jira/browse/HIVE-11723
https://issues.apache.org/jira/browse/HIVE-2943

Thank you!

@dongjoon-hyun
Copy link
Member Author

Sure!

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Mar 13, 2017

For HIVE-11723, it resolved it in SemanticAnalyzer.

I think it's possible to bring that into our @JoshRosen's repo.

Let's me backport that to see if this is enough.

@cloud-fan
Copy link
Contributor

cloud-fan commented Mar 15, 2017

can we say something more in the error message? We should explain that it's a hive bug and put the hive jira in it

@dongjoon-hyun
Copy link
Member Author

The following is the error message. Since we are not escaping in the spark master, the behavior (incorrect filtering or the error message) is the same from the master branch Spark.

java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
  at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:612)
...
Caused by: java.lang.reflect.InvocationTargetException: org.apache.hadoop.hive.metastore.api.MetaException: Error parsing partition filter : line 1:8 mismatched character '<EOF>' expecting '"'
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...
  at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:599)
  ... 103 more
Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Error parsing partition filter : line 1:8 mismatched character '<EOF>' expecting '"'
  at org.apache.hadoop.hive.metastore.ObjectStore.getFilterParser(ObjectStore.java:2569)
  at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:2512)
  at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2335)
...
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:4442)
...
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:1103)
...
  at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:2254)
  ... 108 more

HIVE-11723 seems to resove that in SemanticAnalyzer. So, I need to try that soon.

@dongjoon-hyun
Copy link
Member Author

For non-error message cases, incorrect result is also a problem in this issue.

asfgit pushed a commit that referenced this pull request Mar 21, 2017
…re partition pruning

## What changes were proposed in this pull request?

Since current `HiveShim`'s `convertFilters` does not escape the string literals. There exists the following correctness issues. This PR aims to return the correct result and also shows the more clear exception message.

**BEFORE**

```scala
scala> Seq((1, "p1", "q1"), (2, "p1\" and q=\"q1", "q2")).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("t1")

scala> spark.table("t1").filter($"p" === "p1\" and q=\"q1").select($"a").show
+---+
|  a|
+---+
+---+

scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from ...
```

**AFTER**

```scala
scala> spark.table("t1").filter($"p" === "p1\" and q=\"q1").select($"a").show
+---+
|  a|
+---+
|  2|
+---+

scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
java.lang.UnsupportedOperationException: Partition filter cannot have both `"` and `'` characters
```

## How was this patch tested?

Pass the Jenkins test with new test cases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #17266 from dongjoon-hyun/SPARK-19912.

(cherry picked from commit 21e366a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.1!

@asfgit asfgit closed this in 21e366a Mar 21, 2017
@dongjoon-hyun
Copy link
Member Author

Oh, thank you, @cloud-fan !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-19912 branch January 7, 2019 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants