Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45880][SQL] Make the like pattern semantics used in all commands consistent #43751

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

panbingkun
Copy link
Contributor

@panbingkun panbingkun commented Nov 10, 2023

What changes were proposed in this pull request?

The pr aims to

  • make the like pattern semantics used in SQL SELECT ... FROM ... WHERE LIKE <pattern> and SHOW TABLE EXTENDED LIKE <pattern> consistent. The SQL for semantic change of like include:
    a. SHOW NAMESPACES ... LIKE
    b. SHOW TABLES ... LIKE
    c. SHOW TABLE EXTENDED ... LIKE
    d. SHOW VIEWS ... LIKE
    e. SHOW ... FUNCTIONS ... LIKE
    f. SHOW CATALOGS LIKE

  • introduce a new TableCatalog.listTable overload that takes a pattern string for v2 catalog.

Why are the changes needed?

As we discussed in implementing the ShowTablesExtended logic in the V2, we need such an API in TableCatalog to make the code logic more cohesive.
#37588 (comment)
image

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • Pass GA.
  • Manually test.

Was this patch authored or co-authored using generative AI tooling?

No.

@panbingkun
Copy link
Contributor Author

cc @cloud-fan

@panbingkun panbingkun marked this pull request as ready for review November 10, 2023 09:17
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 19, 2024
* @param namespace a multi-part namespace
* @param pattern the filter pattern, only '*' and '|' are allowed as wildcards, others will
* follow regular expression convention, case-insensitive match and white spaces
* on both ends will be ignored
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a doc page for the pattern string semantic? If we do we should reference it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I searched the document and the only possible relationship is this one:
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-like.html#parameters
image

Perhaps we should explain it in detail here?
(PS: The first pr that introduces StringUtils.filterPattern is: #12206)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea if they use the same implementation. The LIKE pattern doc does not even mention the *.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to document it in the SHOW TABLES doc page.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have looked at the document https://spark.apache.org/docs/latest/sql-ref-syntax-aux-show-tables.html#parameters(SHOW TABLES doc page) and found that the parameter regex_pattern in it explains the pattern.
image
Thank you very much for your reminder, Let's refer to it.

@cloud-fan cloud-fan removed the Stale label Feb 19, 2024
* If the catalog supports views, this must return identifiers for only tables and not views.
*
* @param namespace a multi-part namespace
* @param pattern the filter pattern, only '*' and '|' are allowed as wildcards, others will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but the existing doc is a bit vague. | is not a wildcard, right? And | is also a valid syntax in regex. Can we take a look at other databases and see how they document it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let me investigate it.

@panbingkun
Copy link
Contributor Author

panbingkun commented Feb 20, 2024

@cloud-fan
Copy link
Contributor

I think it's more natural to follow the same behavior of the LIKE operator here. It seems all databases follow it (except for Hive before 4.0). Spark followed Hive at the beginning and that's probably why Spark has this special and weird behavior for the LIKE pattern in SHOW TABLES.

In fact, this is out of Spark's control, as it's the external catalog that applies the pattern string. We should follow the industry standard for defining the v2 catalog API. We should also update the SHOW TABLES doc page to mention the ideal behavior of the pattern string, as well as the legacy Hive behavior.

@panbingkun
Copy link
Contributor Author

I think it's more natural to follow the same behavior of the LIKE operator here. It seems all databases follow it (except for Hive before 4.0). Spark followed Hive at the beginning and that's probably why Spark has this special and weird behavior for the LIKE pattern in SHOW TABLES.

In fact, this is out of Spark's control, as it's the external catalog that applies the pattern string. We should follow the industry standard for defining the v2 catalog API. We should also update the SHOW TABLES doc page to mention the ideal behavior of the pattern string, as well as the legacy Hive behavior.

Okay, let me handle it in this PR and update the document.
Additionally, do we need to add a legacy configuration (default is new behavior) to determine whether it is using the legacy behavior or the new behavior?
(PS: Yes, from the first PR, we can see that the original author's intention was to respect the legacy hive behavior.
image

)

@cloud-fan
Copy link
Contributor

Yea let's add a legacy config.

@panbingkun
Copy link
Contributor Author

panbingkun commented Feb 22, 2024

@cloud-fan

  • The command that supports the 'LIKE ' syntax in spark are as follows:
SQL Synax Example
SHOW namespaces ((FROM | IN) multipartIdentifier)? (LIKE? pattern=stringLit) SHOW NAMESPACES IN ns LIKE 'a*|b*'
SHOW TABLES ((FROM | IN) identifierReference)? (LIKE? pattern=stringLit)? SHOW TABLES IN DB LIKE 'a*|b*'
SHOW TABLE EXTENDED ((FROM | IN) ns=identifierReference)? LIKE pattern=stringLit partitionSpec? SHOW TABLES EXTENDED IN DB LIKE 'a*|b*'
SHOW VIEWS ((FROM | IN) identifierReference)?(LIKE? pattern=stringLit)? SHOW VIEWS IN DB LIKE 'a*|b*'
SHOW identifier? FUNCTIONS ((FROM | IN) ns=identifierReference)? (LIKE? (legacy=multipartIdentifier | pattern=stringLit))? SHOW FUNCTIONS LIKE 'a*|b*'
SHOW CATALOGS (LIKE? pattern=stringLit)? SHOW CATALOGS LIKE 'a*|b*'
  • If we only change the semantics of the wildcard supported by the pattern in command SHOW TABLES EXTENDED ... LIKE <pattern> from (* and |) to (% and _), it seems to cause semantic inconsistency for users. Therefore, I suggest making similar changes to all the commands above.

  • In addition, due to the fact that the maximum version of hive supported by Spark currently is 3.1.3, in this version, the widecard supported by the pattern are still: (* and |), and if we want to achieve support for widecard: (% and _), we may need to adopt some workarounds, such as:
    In the API for get the databases based on the pattern, the current implementation logic is:

    override def getDatabasesByPattern(hive: Hive, pattern: String): Seq[String] = {
    recordHiveCall()
    hive.getDatabasesByPattern(pattern).asScala.toSeq
    }

    When supporting new semantics, the logic may be:
    Step 1: use hive.getAllDatabase to retrieve all databases.
    Step 2: perform pattern filtering on the Spark end for the above results.

    Is this acceptable? This may cause some performance loss, but there seems to be no better way to achieve it.

…ns containing '%' for any character(s), and '_' for a single character
@panbingkun panbingkun changed the title [SPARK-45880][SQL] Introduce a new TableCatalog.listTable overload th… [WIP][SPARK-45880][SQL] Introduce a new TableCatalog.listTable overload th… Feb 26, 2024
@github-actions github-actions bot added the DOCS label Feb 27, 2024
@@ -40,12 +40,18 @@ SHOW TABLES [ { FROM | IN } database_name ] [ LIKE regex_pattern ]

* **regex_pattern**

Specifies the regular expression pattern that is used to filter out unwanted tables.
Specifies the regular expression pattern that is used to filter out unwanted tables.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before:
image

After:
image

}
}

private[util] def likePatternToRegExp(pattern: String): String = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+- ResolvedNamespace V2SessionCatalog(spark_catalog), [showdb]


-- !query
SHOW TABLES LIKE 'show_t1*|show_t2*'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OR syntax represented by | is no longer supported by default.



-- !query
SHOW VIEWS LIKE 'view_1*|view_2*'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OR syntax represented by | is no longer supported by default.

@cloud-fan
Copy link
Contributor

This is a hard decision. Technically the behavior of LIKE in many commands (SHOW TABLES LIKE ...) relies on the underlying catalog, which can be HMS of different versions, or a Hive-compatible metastore service. This is out of Spark's control.

From @panbingkun's investigation, the Hive behavior is actually very weird and different from other main-stream SQL systems (they follow the same behavior of the LIKE expression). Hive 4.0 also switches to the more common behavior.

There are some commands that we implement the LIKE filtering by our own, following the Hive behavior. Now we are in a hard position:

  1. If we do nothing, then Spark's behavior of LIKE in commands is non-standard and different from other databases. We may also hit future behavior changes if we upgrade to Hive 4.0.
  2. If we change the LIKE filtering behavior now, it's a breaking change, and also lead to inconsistent behaviors as some commands use Hive to do LIKE filtering.

cc @srielau

QueryTest.checkAnswer(
sql(s"SHOW USER FUNCTIONS IN $ns LIKE 'crc32i|date*'"),
Seq("crc32i", "date1900", "Date1").map(testFun => Row(qualifiedFunName("ns", testFun))))
withSQLConf(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that this UT is testing "|" and "|" is no longer supported in the new mode, we have set the configuration spark.sql.legacy.useVerticalBarAndStarAsWildcardsInLikePattern to true to complete this test. In the future, we can consider removing this UT

@@ -626,7 +626,12 @@ private[client] class Shim_v2_0 extends Shim with Logging {

override def listFunctions(hive: Hive, db: String, pattern: String): Seq[String] = {
recordHiveCall()
hive.getFunctions(db, pattern).asScala.toSeq
if (SQLConf.get.legacyUseStarAndVerticalBarAsWildcardsInLikePattern) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may cause performance loss, but there seems to be no better way.

@@ -57,54 +57,6 @@ public void close() throws HiveSQLException {
cleanupOperationLog();
}

/**
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following method is extracted separately into class MetadataOperationUtils and renamed as legacyXXX

@@ -81,7 +87,7 @@ private[hive] class SparkGetFunctionsOperation(

try {
matchingDbs.foreach { db =>
catalog.listFunctions(db, functionPattern).foreach {
catalog.listFunctions(db, functionPattern).sortBy { item => item._1.funcName }.foreach {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to make the returned results more stable, as it contains a HashMap data structure.

@@ -39,7 +39,7 @@ case class ShowTablesExec(

val tables = catalog.listTables(namespace.toArray)
tables.map { table =>
if (pattern.map(StringUtils.filterPattern(Seq(table.name()), _).nonEmpty).getOrElse(true)) {
if (pattern.forall(StringUtils.filterPattern(Seq(table.name()), _).nonEmpty)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a correction made based on the syntax prompted by the IDE.

@@ -53,7 +53,7 @@ case class ShowNamespacesExec(

val rows = new ArrayBuffer[InternalRow]()
namespaceNames.map { ns =>
if (pattern.map(StringUtils.filterPattern(Seq(ns), _).nonEmpty).getOrElse(true)) {
if (pattern.forall(StringUtils.filterPattern(Seq(ns), _).nonEmpty)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a correction made based on the syntax prompted by the IDE.

* @param names the names list to be filtered
* @param pattern the filter pattern, only '*' and '|' are allowed as wildcards, others will
* follow regular expression convention, case insensitive match and white spaces
* on both ends will be ignored
* @return the filtered names list in order
*/
def filterPattern(names: Seq[String], pattern: String): Seq[String] = {
def filterPatternLegacy(names: Seq[String], pattern: String): Seq[String] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only rename XXX to legacyXXX.

@panbingkun
Copy link
Contributor Author

This is a hard decision. Technically the behavior of LIKE in many commands (SHOW TABLES LIKE ...) relies on the underlying catalog, which can be HMS of different versions, or a Hive-compatible metastore service. This is out of Spark's control.

From @panbingkun's investigation, the Hive behavior is actually very weird and different from other main-stream SQL systems (they follow the same behavior of the LIKE expression). Hive 4.0 also switches to the more common behavior.

There are some commands that we implement the LIKE filtering by our own, following the Hive behavior. Now we are in a hard position:

  1. If we do nothing, then Spark's behavior of LIKE in commands is non-standard and different from other databases. We may also hit future behavior changes if we upgrade to Hive 4.0.
  2. If we change the LIKE filtering behavior now, it's a breaking change, and also lead to inconsistent behaviors as some commands use Hive to do LIKE filtering.

cc @srielau

After efforts, all commands that support syntax Like <pattern> have been changed. By adding a configuration spark.sql.legacy.useVerticalBarAndStarAsWildcardsInLikePattern (default value false), when it is true, the wildcards supported by the pattern of Like are consistent with the semantics supported before Hive version 4 (use '*' for any character(s) and '|' for a choice as wildcards). If it is false, their behavior is consistent with the semantics of SQL Like (use '%' for any character(s) and '_' for a single character as wildcards), and the document has also been updated synchronously.

@panbingkun panbingkun changed the title [WIP][SPARK-45880][SQL] Introduce a new TableCatalog.listTable overload th… [SPARK-45880][SQL] Make the like pattern semantics used in all commands consistent Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants