[SPARK-34318][SQL][3.1] Dataset.colRegex should work with column names and qualifiers which contain newlines #31457

sarutak · 2021-02-03T11:46:42Z

What changes were proposed in this pull request?

Backport of #31426 for the record.

This PR fixes an issue that Dataset.colRegex doesn't work with column names or qualifiers which contain newlines.
In the current master, if column names or qualifiers passed to colRegex contain newlines, it throws exception.

val df = Seq(1, 2, 3).toDF("test\n_column").as("test\n_table")
val col1 = df.colRegex("`tes.*\n.*mn`")
org.apache.spark.sql.AnalysisException: Cannot resolve column name "`tes.*
.*mn`" among (test
_column)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272)
  at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263)
  at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407)
  ... 47 elided

val col2 = df.colRegex("test\n_table.`tes.*\n.*mn`")
org.apache.spark.sql.AnalysisException: Cannot resolve column name "test
_table.`tes.*
.*mn`" among (test
_column)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272)
  at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263)
  at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407)
  ... 47 elided

Why are the changes needed?

Column names and qualifiers can contain newlines but colRegex can't work with them, so it's a bug.

Does this PR introduce any user-facing change?

Yes. users can pass column names and qualifiers even though they contain newlines.

How was this patch tested?

New test.

SparkQA · 2021-02-03T12:44:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39421/

SparkQA · 2021-02-03T12:49:05Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39421/

srowen · 2021-02-03T14:14:13Z

(Backport of #31426 for the record - might be good to put in the description)
Seems fine if it passes and is a backport.

SparkQA · 2021-02-03T16:34:47Z

Test build #134834 has finished for PR 31457 at commit 7d0476d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sarutak · 2021-02-03T17:55:45Z

@srowen Thanks. I've updated.

HyukjinKwon · 2021-02-04T01:52:23Z

Merged to branch-3.1.

…s and qualifiers which contain newlines ### What changes were proposed in this pull request? Backport of #31426 for the record. This PR fixes an issue that `Dataset.colRegex` doesn't work with column names or qualifiers which contain newlines. In the current master, if column names or qualifiers passed to `colRegex` contain newlines, it throws exception. ``` val df = Seq(1, 2, 3).toDF("test\n_column").as("test\n_table") val col1 = df.colRegex("`tes.*\n.*mn`") org.apache.spark.sql.AnalysisException: Cannot resolve column name "`tes.* .*mn`" among (test _column) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272) at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263) at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407) ... 47 elided val col2 = df.colRegex("test\n_table.`tes.*\n.*mn`") org.apache.spark.sql.AnalysisException: Cannot resolve column name "test _table.`tes.* .*mn`" among (test _column) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272) at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263) at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407) ... 47 elided ``` ### Why are the changes needed? Column names and qualifiers can contain newlines but `colRegex` can't work with them, so it's a bug. ### Does this PR introduce _any_ user-facing change? Yes. users can pass column names and qualifiers even though they contain newlines. ### How was this patch tested? New test. Closes #31457 from sarutak/SPARK-34318-branch-3.1. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Backport SPARK-34318.

7d0476d

github-actions bot added the SQL label Feb 3, 2021

HyukjinKwon approved these changes Feb 4, 2021

View reviewed changes

HyukjinKwon closed this Feb 4, 2021

sarutak deleted the SPARK-34318-branch-3.1 branch June 4, 2021 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34318][SQL][3.1] Dataset.colRegex should work with column names and qualifiers which contain newlines #31457

[SPARK-34318][SQL][3.1] Dataset.colRegex should work with column names and qualifiers which contain newlines #31457

sarutak commented Feb 3, 2021 •

edited

SparkQA commented Feb 3, 2021

SparkQA commented Feb 3, 2021

srowen commented Feb 3, 2021

SparkQA commented Feb 3, 2021

sarutak commented Feb 3, 2021

HyukjinKwon commented Feb 4, 2021

[SPARK-34318][SQL][3.1] Dataset.colRegex should work with column names and qualifiers which contain newlines #31457

[SPARK-34318][SQL][3.1] Dataset.colRegex should work with column names and qualifiers which contain newlines #31457

Conversation

sarutak commented Feb 3, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Feb 3, 2021

SparkQA commented Feb 3, 2021

srowen commented Feb 3, 2021

SparkQA commented Feb 3, 2021

sarutak commented Feb 3, 2021

HyukjinKwon commented Feb 4, 2021

sarutak commented Feb 3, 2021 •

edited