New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45880][SQL] Make the like pattern
semantics used in all commands consistent
#43751
base: master
Are you sure you want to change the base?
Conversation
…at takes a pattern string for v2 catalog
cc @cloud-fan |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
* @param namespace a multi-part namespace | ||
* @param pattern the filter pattern, only '*' and '|' are allowed as wildcards, others will | ||
* follow regular expression convention, case-insensitive match and white spaces | ||
* on both ends will be ignored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a doc page for the pattern string semantic? If we do we should reference it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I searched the document and the only possible relationship
is this one:
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-like.html#parameters
Perhaps we should explain it in detail here?
(PS: The first pr that introduces StringUtils.filterPattern
is: #12206)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea if they use the same implementation. The LIKE pattern doc does not even mention the *
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to document it in the SHOW TABLES doc page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* If the catalog supports views, this must return identifiers for only tables and not views. | ||
* | ||
* @param namespace a multi-part namespace | ||
* @param pattern the filter pattern, only '*' and '|' are allowed as wildcards, others will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but the existing doc is a bit vague. |
is not a wildcard, right? And |
is also a valid syntax in regex. Can we take a look at other databases and see how they document it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let me investigate it.
Just for making investigation records:
|
I think it's more natural to follow the same behavior of the LIKE operator here. It seems all databases follow it (except for Hive before 4.0). Spark followed Hive at the beginning and that's probably why Spark has this special and weird behavior for the LIKE pattern in SHOW TABLES. In fact, this is out of Spark's control, as it's the external catalog that applies the pattern string. We should follow the industry standard for defining the v2 catalog API. We should also update the SHOW TABLES doc page to mention the ideal behavior of the pattern string, as well as the legacy Hive behavior. |
Yea let's add a legacy config. |
|
…ns containing '%' for any character(s), and '_' for a single character
@@ -40,12 +40,18 @@ SHOW TABLES [ { FROM | IN } database_name ] [ LIKE regex_pattern ] | |||
|
|||
* **regex_pattern** | |||
|
|||
Specifies the regular expression pattern that is used to filter out unwanted tables. | |||
Specifies the regular expression pattern that is used to filter out unwanted tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
} | ||
|
||
private[util] def likePatternToRegExp(pattern: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow the implementation logic of hive:
https://github.com/apache/hive/blob/89005659d1d8e167208ba4f9f9aaa2de7703229d/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java#L64-L86
+- ResolvedNamespace V2SessionCatalog(spark_catalog), [showdb] | ||
|
||
|
||
-- !query | ||
SHOW TABLES LIKE 'show_t1*|show_t2*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OR syntax represented by | is no longer supported by default.
|
||
|
||
-- !query | ||
SHOW VIEWS LIKE 'view_1*|view_2*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OR syntax represented by | is no longer supported by default.
This is a hard decision. Technically the behavior of LIKE in many commands ( From @panbingkun's investigation, the Hive behavior is actually very weird and different from other main-stream SQL systems (they follow the same behavior of the LIKE expression). Hive 4.0 also switches to the more common behavior. There are some commands that we implement the LIKE filtering by our own, following the Hive behavior. Now we are in a hard position:
cc @srielau |
QueryTest.checkAnswer( | ||
sql(s"SHOW USER FUNCTIONS IN $ns LIKE 'crc32i|date*'"), | ||
Seq("crc32i", "date1900", "Date1").map(testFun => Row(qualifiedFunName("ns", testFun)))) | ||
withSQLConf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that this UT is testing "|" and "|" is no longer supported in the new mode, we have set the configuration spark.sql.legacy.useVerticalBarAndStarAsWildcardsInLikePattern
to true to complete this test. In the future, we can consider removing this UT
@@ -626,7 +626,12 @@ private[client] class Shim_v2_0 extends Shim with Logging { | |||
|
|||
override def listFunctions(hive: Hive, db: String, pattern: String): Seq[String] = { | |||
recordHiveCall() | |||
hive.getFunctions(db, pattern).asScala.toSeq | |||
if (SQLConf.get.legacyUseStarAndVerticalBarAsWildcardsInLikePattern) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may cause performance loss
, but there seems to be no better
way.
@@ -57,54 +57,6 @@ public void close() throws HiveSQLException { | |||
cleanupOperationLog(); | |||
} | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following method is extracted separately into class MetadataOperationUtils
and renamed as legacyXXX
@@ -81,7 +87,7 @@ private[hive] class SparkGetFunctionsOperation( | |||
|
|||
try { | |||
matchingDbs.foreach { db => | |||
catalog.listFunctions(db, functionPattern).foreach { | |||
catalog.listFunctions(db, functionPattern).sortBy { item => item._1.funcName }.foreach { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to make the returned results more stable
, as it contains a HashMap
data structure.
@@ -39,7 +39,7 @@ case class ShowTablesExec( | |||
|
|||
val tables = catalog.listTables(namespace.toArray) | |||
tables.map { table => | |||
if (pattern.map(StringUtils.filterPattern(Seq(table.name()), _).nonEmpty).getOrElse(true)) { | |||
if (pattern.forall(StringUtils.filterPattern(Seq(table.name()), _).nonEmpty)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only a correction made based on the syntax prompted by the IDE
.
@@ -53,7 +53,7 @@ case class ShowNamespacesExec( | |||
|
|||
val rows = new ArrayBuffer[InternalRow]() | |||
namespaceNames.map { ns => | |||
if (pattern.map(StringUtils.filterPattern(Seq(ns), _).nonEmpty).getOrElse(true)) { | |||
if (pattern.forall(StringUtils.filterPattern(Seq(ns), _).nonEmpty)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only a correction made based on the syntax prompted by the IDE
.
* @param names the names list to be filtered | ||
* @param pattern the filter pattern, only '*' and '|' are allowed as wildcards, others will | ||
* follow regular expression convention, case insensitive match and white spaces | ||
* on both ends will be ignored | ||
* @return the filtered names list in order | ||
*/ | ||
def filterPattern(names: Seq[String], pattern: String): Seq[String] = { | ||
def filterPatternLegacy(names: Seq[String], pattern: String): Seq[String] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only rename XXX
to legacyXXX
.
After efforts, all commands that support syntax |
pattern
semantics used in all commands consistent
What changes were proposed in this pull request?
The pr aims to
make the
like pattern
semantics used in SQLSELECT ... FROM ... WHERE LIKE <pattern>
andSHOW TABLE EXTENDED LIKE <pattern>
consistent. The SQL for semantic change of like include:a. SHOW NAMESPACES ... LIKE
b. SHOW TABLES ... LIKE
c. SHOW TABLE EXTENDED ... LIKE
d. SHOW VIEWS ... LIKE
e. SHOW ... FUNCTIONS ... LIKE
f. SHOW CATALOGS LIKE
introduce a new TableCatalog.listTable overload that takes a pattern string for v2 catalog.
Why are the changes needed?
As we discussed in implementing the
ShowTablesExtended
logic in the V2, we need such an API inTableCatalog
to make the code logic more cohesive.#37588 (comment)
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No.