Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34318][SQL][3.1] Dataset.colRegex should work with column names and qualifiers which contain newlines #31457

Closed
wants to merge 1 commit into from

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented Feb 3, 2021

What changes were proposed in this pull request?

Backport of #31426 for the record.

This PR fixes an issue that Dataset.colRegex doesn't work with column names or qualifiers which contain newlines.
In the current master, if column names or qualifiers passed to colRegex contain newlines, it throws exception.

val df = Seq(1, 2, 3).toDF("test\n_column").as("test\n_table")
val col1 = df.colRegex("`tes.*\n.*mn`")
org.apache.spark.sql.AnalysisException: Cannot resolve column name "`tes.*
.*mn`" among (test
_column)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272)
  at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263)
  at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407)
  ... 47 elided

val col2 = df.colRegex("test\n_table.`tes.*\n.*mn`")
org.apache.spark.sql.AnalysisException: Cannot resolve column name "test
_table.`tes.*
.*mn`" among (test
_column)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272)
  at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263)
  at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407)
  ... 47 elided

Why are the changes needed?

Column names and qualifiers can contain newlines but colRegex can't work with them, so it's a bug.

Does this PR introduce any user-facing change?

Yes. users can pass column names and qualifiers even though they contain newlines.

How was this patch tested?

New test.

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39421/

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39421/

@srowen
Copy link
Member

srowen commented Feb 3, 2021

(Backport of #31426 for the record - might be good to put in the description)
Seems fine if it passes and is a backport.

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Test build #134834 has finished for PR 31457 at commit 7d0476d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions github-actions bot added the SQL label Feb 3, 2021
@sarutak
Copy link
Member Author

sarutak commented Feb 3, 2021

@srowen Thanks. I've updated.

@HyukjinKwon
Copy link
Member

Merged to branch-3.1.

HyukjinKwon pushed a commit that referenced this pull request Feb 4, 2021
…s and qualifiers which contain newlines

### What changes were proposed in this pull request?

Backport of #31426 for the record.

This PR fixes an issue that `Dataset.colRegex` doesn't work with column names or qualifiers which contain newlines.
In the current master, if column names or qualifiers passed to `colRegex` contain newlines, it throws exception.
```
val df = Seq(1, 2, 3).toDF("test\n_column").as("test\n_table")
val col1 = df.colRegex("`tes.*\n.*mn`")
org.apache.spark.sql.AnalysisException: Cannot resolve column name "`tes.*
.*mn`" among (test
_column)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272)
  at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263)
  at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407)
  ... 47 elided

val col2 = df.colRegex("test\n_table.`tes.*\n.*mn`")
org.apache.spark.sql.AnalysisException: Cannot resolve column name "test
_table.`tes.*
.*mn`" among (test
_column)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$resolveException(Dataset.scala:272)
  at org.apache.spark.sql.Dataset.$anonfun$resolve$1(Dataset.scala:263)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:263)
  at org.apache.spark.sql.Dataset.colRegex(Dataset.scala:1407)
  ... 47 elided
```

### Why are the changes needed?

Column names and qualifiers can contain newlines but `colRegex` can't work with them, so it's a bug.

### Does this PR introduce _any_ user-facing change?

Yes. users can pass column names and qualifiers even though they contain newlines.

### How was this patch tested?

New test.

Closes #31457 from sarutak/SPARK-34318-branch-3.1.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon closed this Feb 4, 2021
@sarutak sarutak deleted the SPARK-34318-branch-3.1 branch June 4, 2021 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants