Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34083][SQL] Using TPCDS original definitions for char/varchar columns #31012

Closed
wants to merge 3 commits into from
Closed

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Jan 4, 2021

What changes were proposed in this pull request?

This PR changes the column types in the table definitions of TPCDSBase from string to char and varchar, with respect to the original definitions for char/varchar columns in the official doc - TPC-DS_v2.9.0.

Why are the changes needed?

Comply with both TPCDS standard and ANSI, and using string will get wrong results with those TPCDS queries

Does this PR introduce any user-facing change?

no

How was this patch tested?

plan stability check

@github-actions github-actions bot added the SQL label Jan 4, 2021
@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38199/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38199/

@SparkQA
Copy link

SparkQA commented Jan 5, 2021

Test build #133610 has finished for PR 31012 at commit 4affca9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38310/

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38310/

@SparkQA
Copy link

SparkQA commented Jan 6, 2021

Test build #133722 has finished for PR 31012 at commit 021d3bb.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38515/

@SparkQA
Copy link

SparkQA commented Jan 11, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38515/

@SparkQA
Copy link

SparkQA commented Jan 11, 2021

Test build #133928 has finished for PR 31012 at commit 7078573.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn yaooqinn changed the title [WIP][SQL] Using TPCDS original definitions for char/varchar columns [SPARK-34083][SQL] Using TPCDS original definitions for char/varchar columns Jan 12, 2021
@SparkQA
Copy link

SparkQA commented Jan 12, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38551/

@SparkQA
Copy link

SparkQA commented Jan 12, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38551/

@SparkQA
Copy link

SparkQA commented Jan 12, 2021

Test build #133964 has finished for PR 31012 at commit 1145e67.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

cloud-fan pushed a commit that referenced this pull request Jan 13, 2021
…odegen in length check for char varchar

### What changes were proposed in this pull request?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/

We can reduce more than 8000 bytes by removing the unnecessary CONCAT expression.

W/ this fix, for q41 in TPCDS with [Using TPCDS original definitions for char/varchar columns](#31012) applied, we can reduce the stage code-gen size from 22523 to 14369
```
14369  - 22523 = - 8154
```

### Why are the changes needed?

fix the perf regression(we need other improvements for q41 works), there will be a huge performance regression if codegen fails

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

modified uts

Closes #31150 from yaooqinn/SPARK-34086.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jan 14, 2021
…ils codegen in length check for char varchar

A backport for #31150 to branch 3.1

### What changes were proposed in this pull request?

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/

We can reduce more than 8000 bytes by removing the unnecessary CONCAT expression.

W/ this fix, for q41 in TPCDS with [Using TPCDS original definitions for char/varchar columns](#31012) applied, we can reduce the stage code-gen size from 22523 to 14369
```
14369  - 22523 = - 8154
```

### Why are the changes needed?

fix the perf regression(we need other improvements for q41 works), there will be a huge performance regression if codegen fails

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

modified uts

Closes #31168 from yaooqinn/SPARK-34086-31.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@SparkQA
Copy link

SparkQA commented Jan 15, 2021

Test build #134121 has started for PR 31012 at commit 9d5ed31.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge change, but most seem like changes to golden result sets. Can you please further elaborate what the actual change is and why it's needed? I don't see any detail in this or the JIRA.

cloud-fan pushed a commit that referenced this pull request Jan 19, 2021
…gth check with StaticInvoke

### What changes were proposed in this pull request?

This could reduce the `generate.java` size to prevent codegen fallback which causes performance regression.

here is a case from tpcds that could be fixed by this improvement
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133964/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/

The original case generate 20K bytes, we are trying to reduce it to less than 8k
### Why are the changes needed?

performance improvement as in the PR benchmark test, the performance  w/ codegen is 2~3x better than w/o codegen.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

yes, it's a code reflect so the existing ut should be enough

cross-check with #31012 where the tpcds shall all pass

benchmark compared with master

```logtalk
================================================================================================
Char Varchar Read Side Perf
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 20, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 20                         1571           1667          83         63.6          15.7       1.0X
read char with length 20                           1710           1764          58         58.5          17.1       0.9X
read varchar with length 20                        1774           1792          16         56.4          17.7       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 40, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 40                         1824           1927          91         54.8          18.2       1.0X
read char with length 40                           1788           1928         137         55.9          17.9       1.0X
read varchar with length 40                        1676           1700          40         59.7          16.8       1.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 60, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 60                         1727           1762          30         57.9          17.3       1.0X
read char with length 60                           1628           1674          43         61.4          16.3       1.1X
read varchar with length 60                        1651           1665          13         60.6          16.5       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 80, hasSpaces: true:     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 80                         1748           1778          28         57.2          17.5       1.0X
read char with length 80                           1673           1678           9         59.8          16.7       1.0X
read varchar with length 80                        1667           1684          27         60.0          16.7       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 100, hasSpaces: true:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 100                        1709           1743          48         58.5          17.1       1.0X
read char with length 100                          1610           1664          67         62.1          16.1       1.1X
read varchar with length 100                       1614           1673          53         61.9          16.1       1.1X

================================================================================================
Char Varchar Write Side Perf
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 20, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 20                        2277           2327          67          4.4         227.7       1.0X
write char with length 20                          2421           2443          19          4.1         242.1       0.9X
write varchar with length 20                       2393           2419          27          4.2         239.3       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 40, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 40                        2249           2290          38          4.4         224.9       1.0X
write char with length 40                          2386           2444          57          4.2         238.6       0.9X
write varchar with length 40                       2397           2405          12          4.2         239.7       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 60, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 60                        2326           2367          41          4.3         232.6       1.0X
write char with length 60                          2478           2501          37          4.0         247.8       0.9X
write varchar with length 60                       2475           2503          24          4.0         247.5       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 80, hasSpaces: true:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 80                        9367           9773         354          1.1         936.7       1.0X
write char with length 80                         10454          10621         238          1.0        1045.4       0.9X
write varchar with length 80                      18943          19503         571          0.5        1894.3       0.5X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 100, hasSpaces: true:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 100                      11055          11104          59          0.9        1105.5       1.0X
write char with length 100                        12204          12275          63          0.8        1220.4       0.9X
write varchar with length 100                     21737          22275         574          0.5        2173.7       0.5X

```

Closes #31199 from yaooqinn/SPARK-34130.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6fa2fb9)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jan 19, 2021
…gth check with StaticInvoke

### What changes were proposed in this pull request?

This could reduce the `generate.java` size to prevent codegen fallback which causes performance regression.

here is a case from tpcds that could be fixed by this improvement
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133964/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/

The original case generate 20K bytes, we are trying to reduce it to less than 8k
### Why are the changes needed?

performance improvement as in the PR benchmark test, the performance  w/ codegen is 2~3x better than w/o codegen.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

yes, it's a code reflect so the existing ut should be enough

cross-check with #31012 where the tpcds shall all pass

benchmark compared with master

```logtalk
================================================================================================
Char Varchar Read Side Perf
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 20, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 20                         1571           1667          83         63.6          15.7       1.0X
read char with length 20                           1710           1764          58         58.5          17.1       0.9X
read varchar with length 20                        1774           1792          16         56.4          17.7       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 40, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 40                         1824           1927          91         54.8          18.2       1.0X
read char with length 40                           1788           1928         137         55.9          17.9       1.0X
read varchar with length 40                        1676           1700          40         59.7          16.8       1.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 60, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 60                         1727           1762          30         57.9          17.3       1.0X
read char with length 60                           1628           1674          43         61.4          16.3       1.1X
read varchar with length 60                        1651           1665          13         60.6          16.5       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 80, hasSpaces: true:     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 80                         1748           1778          28         57.2          17.5       1.0X
read char with length 80                           1673           1678           9         59.8          16.7       1.0X
read varchar with length 80                        1667           1684          27         60.0          16.7       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 100, hasSpaces: true:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 100                        1709           1743          48         58.5          17.1       1.0X
read char with length 100                          1610           1664          67         62.1          16.1       1.1X
read varchar with length 100                       1614           1673          53         61.9          16.1       1.1X

================================================================================================
Char Varchar Write Side Perf
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 20, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 20                        2277           2327          67          4.4         227.7       1.0X
write char with length 20                          2421           2443          19          4.1         242.1       0.9X
write varchar with length 20                       2393           2419          27          4.2         239.3       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 40, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 40                        2249           2290          38          4.4         224.9       1.0X
write char with length 40                          2386           2444          57          4.2         238.6       0.9X
write varchar with length 40                       2397           2405          12          4.2         239.7       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 60, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 60                        2326           2367          41          4.3         232.6       1.0X
write char with length 60                          2478           2501          37          4.0         247.8       0.9X
write varchar with length 60                       2475           2503          24          4.0         247.5       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 80, hasSpaces: true:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 80                        9367           9773         354          1.1         936.7       1.0X
write char with length 80                         10454          10621         238          1.0        1045.4       0.9X
write varchar with length 80                      18943          19503         571          0.5        1894.3       0.5X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 100, hasSpaces: true:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 100                      11055          11104          59          0.9        1105.5       1.0X
write char with length 100                        12204          12275          63          0.8        1220.4       0.9X
write varchar with length 100                     21737          22275         574          0.5        2173.7       0.5X

```

Closes #31199 from yaooqinn/SPARK-34130.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@SparkQA
Copy link

SparkQA commented Jan 19, 2021

Test build #134222 has started for PR 31012 at commit 5817a66.

@maropu
Copy link
Member

maropu commented Jan 20, 2021

IIUC our TPCDS queries refer to v1.4 and v2.7. These DDL stmts are the same with v2.9 ones?

@maropu
Copy link
Member

maropu commented Jan 20, 2021

Could you check the other DDL definition for ssb (SSBQuerySuite) and tpc-h (TPCHQuerySuite)? We need the same change there, too?

@SparkQA
Copy link

SparkQA commented Jan 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38876/

@SparkQA
Copy link

SparkQA commented Jan 20, 2021

Test build #134290 has finished for PR 31012 at commit 7502623.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Test build #134330 has started for PR 31012 at commit 4635070.

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38916/

@SparkQA
Copy link

SparkQA commented Jan 21, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38916/

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Test build #134486 has finished for PR 31012 at commit 77d2397.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The plan change is padding string literals when comparing with char type columns.

@yaooqinn
Copy link
Member Author

I have updated the PR desc and the LOC is reduced from about 50K in the very first push to 500 with a series of PRs. The plans look OK to me now.
I'd like you guys to review this again. Thanks very much.

@yaooqinn yaooqinn requested a review from srowen January 26, 2021 15:12
@@ -418,7 +418,7 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
// If the alias name is different from attribute name, we can't strip it either, or we
// may accidentally change the output schema name of the root plan.
case a @ Alias(attr: Attribute, name)
if a.metadata == Metadata.empty &&
if (a.metadata == Metadata.empty || a.metadata == attr.metadata) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only non-testing change in this PR, but it's very minor and obvious.

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39092/

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super qualified to weigh in, but seems reasonable.

@@ -128,15 +128,15 @@ Input [9]: [ss_item_sk#1, ss_cdemo_sk#2, ss_store_sk#3, ss_quantity#4, ss_list_p
Output [4]: [cd_demo_sk#13, cd_gender#14, cd_marital_status#15, cd_education_status#16]
Batched: true
Location [not included in comparison]/{warehouse_dir}/customer_demographics]
PushedFilters: [IsNotNull(cd_gender), IsNotNull(cd_marital_status), IsNotNull(cd_education_status), EqualTo(cd_gender,F), EqualTo(cd_marital_status,D), EqualTo(cd_education_status,Primary), IsNotNull(cd_demo_sk)]
PushedFilters: [IsNotNull(cd_gender), IsNotNull(cd_marital_status), IsNotNull(cd_education_status), EqualTo(cd_gender,F), EqualTo(cd_marital_status,D), EqualTo(cd_education_status,Primary ), IsNotNull(cd_demo_sk)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No big deal at all, just wondering why this now has whitespace? is it because these are now fixed-length strings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cd_education_status is char type, which means the data of it in the storage are always padded as fixed-length strings. While before it's string type and doesn't have this issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dsdgen tool would produce fixed-length strings for char values in the original data, so we should pad accordingly to match correctlly

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39092/

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Test build #134506 has finished for PR 31012 at commit 1aa22f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.1! (since it's mostly test-only)

@cloud-fan cloud-fan closed this in 5718d64 Jan 27, 2021
@cloud-fan
Copy link
Contributor

it has conflicts with 3.1, @yaooqinn can you create a backport PR?

skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…gth check with StaticInvoke

### What changes were proposed in this pull request?

This could reduce the `generate.java` size to prevent codegen fallback which causes performance regression.

here is a case from tpcds that could be fixed by this improvement
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133964/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/

The original case generate 20K bytes, we are trying to reduce it to less than 8k
### Why are the changes needed?

performance improvement as in the PR benchmark test, the performance  w/ codegen is 2~3x better than w/o codegen.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

yes, it's a code reflect so the existing ut should be enough

cross-check with apache#31012 where the tpcds shall all pass

benchmark compared with master

```logtalk
================================================================================================
Char Varchar Read Side Perf
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 20, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 20                         1571           1667          83         63.6          15.7       1.0X
read char with length 20                           1710           1764          58         58.5          17.1       0.9X
read varchar with length 20                        1774           1792          16         56.4          17.7       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 40, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 40                         1824           1927          91         54.8          18.2       1.0X
read char with length 40                           1788           1928         137         55.9          17.9       1.0X
read varchar with length 40                        1676           1700          40         59.7          16.8       1.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 60, hasSpaces: false:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 60                         1727           1762          30         57.9          17.3       1.0X
read char with length 60                           1628           1674          43         61.4          16.3       1.1X
read varchar with length 60                        1651           1665          13         60.6          16.5       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 80, hasSpaces: true:     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 80                         1748           1778          28         57.2          17.5       1.0X
read char with length 80                           1673           1678           9         59.8          16.7       1.0X
read varchar with length 80                        1667           1684          27         60.0          16.7       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Read with length 100, hasSpaces: true:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
read string with length 100                        1709           1743          48         58.5          17.1       1.0X
read char with length 100                          1610           1664          67         62.1          16.1       1.1X
read varchar with length 100                       1614           1673          53         61.9          16.1       1.1X

================================================================================================
Char Varchar Write Side Perf
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 20, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 20                        2277           2327          67          4.4         227.7       1.0X
write char with length 20                          2421           2443          19          4.1         242.1       0.9X
write varchar with length 20                       2393           2419          27          4.2         239.3       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 40, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 40                        2249           2290          38          4.4         224.9       1.0X
write char with length 40                          2386           2444          57          4.2         238.6       0.9X
write varchar with length 40                       2397           2405          12          4.2         239.7       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 60, hasSpaces: false:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 60                        2326           2367          41          4.3         232.6       1.0X
write char with length 60                          2478           2501          37          4.0         247.8       0.9X
write varchar with length 60                       2475           2503          24          4.0         247.5       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 80, hasSpaces: true:    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 80                        9367           9773         354          1.1         936.7       1.0X
write char with length 80                         10454          10621         238          1.0        1045.4       0.9X
write varchar with length 80                      18943          19503         571          0.5        1894.3       0.5X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Write with length 100, hasSpaces: true:   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
write string with length 100                      11055          11104          59          0.9        1105.5       1.0X
write char with length 100                        12204          12275          63          0.8        1220.4       0.9X
write varchar with length 100                     21737          22275         574          0.5        2173.7       0.5X

```

Closes apache#31199 from yaooqinn/SPARK-34130.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…columns

### What changes were proposed in this pull request?

This PR changes the column types in the table definitions of `TPCDSBase` from string to char and varchar, with respect to the original definitions for char/varchar columns in the official doc - [TPC-DS_v2.9.0](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.9.0.pdf).

### Why are the changes needed?

Comply with both TPCDS standard and ANSI, and using string will get wrong results with those TPCDS queries

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

plan stability check

Closes apache#31012 from yaooqinn/tpcds.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|`d_fy_quarter_seq` INT,
|`d_fy_week_seq` INT,
|`d_day_name` CHAR(9),
|`d_quarter_name` CHAR(1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix this in myPR #31886.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do that separately becuase #31886 is about adding a new GitHub Action job?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, sure, @dongjoon-hyun ~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dongjoon-hyun pushed a commit that referenced this pull request Mar 23, 2021
…me` in the TPCDS schema

### What changes were proposed in this pull request?

SPARK-34842 (#31012) has a typo in the type of `date_dim.d_quarter_name` in the TPCDS schema (`TPCDSBase`). This PR replace `CHAR(1)` with `CHAR(6)`. This fix comes from p28 in [the TPCDS official doc](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.9.0.pdf).

### Why are the changes needed?

Bugfix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #31943 from maropu/SPARK-34083-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun pushed a commit that referenced this pull request Mar 23, 2021
…me` in the TPCDS schema

### What changes were proposed in this pull request?

SPARK-34842 (#31012) has a typo in the type of `date_dim.d_quarter_name` in the TPCDS schema (`TPCDSBase`). This PR replace `CHAR(1)` with `CHAR(6)`. This fix comes from p28 in [the TPCDS official doc](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.9.0.pdf).

### Why are the changes needed?

Bugfix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #31943 from maropu/SPARK-34083-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 0494dc9)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…me` in the TPCDS schema

### What changes were proposed in this pull request?

SPARK-34842 (apache#31012) has a typo in the type of `date_dim.d_quarter_name` in the TPCDS schema (`TPCDSBase`). This PR replace `CHAR(1)` with `CHAR(6)`. This fix comes from p28 in [the TPCDS official doc](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.9.0.pdf).

### Why are the changes needed?

Bugfix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes apache#31943 from maropu/SPARK-34083-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 0494dc9)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants