[SPARK-33593][SQL][3.0] Vector reader got incorrect data with binary partition value #30839

AngersZhuuuu · 2020-12-18T10:09:34Z

What changes were proposed in this pull request?

Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT

test("Parquet vector reader incorrect with binary partition value") {
  Seq(false, true).foreach(tag => {
    withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
      withTable("t1") {
        sql(
          """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
            | USING PARQUET PARTITIONED BY (part)""".stripMargin)
        sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')")
        if (tag) {
          checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
            Row("a", "Spark SQL", ""))
        } else {
          checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
            Row("a", "Spark SQL", "Spark SQL"))
        }
      }
    }
  })
}

Why are the changes needed?

Fix data incorrect issue

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

…tion value

SparkQA · 2020-12-18T10:58:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37609/

SparkQA · 2020-12-18T11:18:02Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37609/

SparkQA · 2020-12-18T13:46:18Z

Test build #133010 has finished for PR 30839 at commit 36335b8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-12-18T14:23:45Z

retest this please

SparkQA · 2020-12-18T15:06:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37620/

SparkQA · 2020-12-18T15:26:37Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37620/

dongjoon-hyun

+1, LGTM. Thank you, @AngersZhuuuu .
Merged to branch-3.0.

…partition value ### What changes were proposed in this pull request? Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT ```scala test("Parquet vector reader incorrect with binary partition value") { Seq(false, true).foreach(tag => { withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { withTable("t1") { sql( """CREATE TABLE t1(name STRING, id BINARY, part BINARY) | USING PARQUET PARTITIONED BY (part)""".stripMargin) sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')") if (tag) { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "")) } else { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "Spark SQL")) } } } }) } ``` ### Why are the changes needed? Fix data incorrect issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #30839 from AngersZhuuuu/SPARK-33593-3.0. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

SparkQA · 2020-12-18T18:52:27Z

Test build #133021 has finished for PR 30839 at commit 36335b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

[SPARK-33593][SQL] Vector reader got incorrect data with binary parti…

36335b8

…tion value

AngersZhuuuu changed the title ~~[SPARK-33593][SQL] Vector reader got incorrect data with binary partition value~~ [SPARK-33593][SQL][FOLLOW-UP][3.0] Vector reader got incorrect data with binary partition value Dec 18, 2020

dongjoon-hyun changed the title ~~[SPARK-33593][SQL][FOLLOW-UP][3.0] Vector reader got incorrect data with binary partition value~~ [SPARK-33593][SQL][3.0] Vector reader got incorrect data with binary partition value Dec 18, 2020

dongjoon-hyun approved these changes Dec 18, 2020

View reviewed changes

dongjoon-hyun closed this Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33593][SQL][3.0] Vector reader got incorrect data with binary partition value #30839

[SPARK-33593][SQL][3.0] Vector reader got incorrect data with binary partition value #30839

AngersZhuuuu commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

AngersZhuuuu commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

dongjoon-hyun left a comment

SparkQA commented Dec 18, 2020

[SPARK-33593][SQL][3.0] Vector reader got incorrect data with binary partition value #30839

[SPARK-33593][SQL][3.0] Vector reader got incorrect data with binary partition value #30839

Conversation

AngersZhuuuu commented Dec 18, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

AngersZhuuuu commented Dec 18, 2020

SparkQA commented Dec 18, 2020

SparkQA commented Dec 18, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Dec 18, 2020