Skip to content

Commit

Permalink
[KYUUBI #1496] Support tpcds benchmark
Browse files Browse the repository at this point in the history
<!--
Thanks for sending a pull request!

Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
  2. If the PR is related to an issue in https://github.com/apache/incubator-kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->

### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
  1. If you add a feature, you can talk about the use case of it.
  2. If you fix a bug, you can clarify why it is a bug.
-->
Support tpcds benchmark in `dev/kyuubi-tpcds` module.

Add a `README.md` in `dev/kyuubi-tpcds` module to show how to use.

The mian code is from [databricks-spark-sql-perf](https://github.com/databricks/spark-sql-perf)

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [x] [Run test](https://kyuubi.readthedocs.io/en/latest/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #1496 from ulysses-you/tpcds-benchmark.

Closes #1496

d4afe2d [ulysses-you] comment
54a146e [ulysses-you] pom
91e7169 [ulysses-you] docs
20eadc4 [ulysses-you] benchmark

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: ulysses-you <ulyssesyou@apache.org>
  • Loading branch information
ulysses-you committed Dec 6, 2021
1 parent dad48c9 commit 37a4e5c
Show file tree
Hide file tree
Showing 117 changed files with 7,419 additions and 563 deletions.
73 changes: 73 additions & 0 deletions dev/kyuubi-tpcds/README.md
@@ -0,0 +1,73 @@
<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
-->

# Introduction
This module includes tpcds data generator and benchmark.

# How to use

package jar with following command:
`./build/mvn install -DskipTests -Ptpcds -pl dev/kyuubi-tpcds -am`

## data generator

Support options:

| key | default | description |
|-------------|---------|------------------------------|
| db | default | the databases to write data |
| scaleFactor | 1 | the scale factor of tpcds |

Example: the following command to generate 10GB data with new database `tpcds_sf10`.

```shell
$SPARK_HOME/bin/spark-submit \
--class org.apache.kyuubi.tpcds.DataGenerator \
kyuubi-tpcds-*.jar --db tpcds_sf10 --scaleFactor 10
```

## do benchmark

Support options:

| key | default | description |
|------------|----------------------|--------------------------------------------------------|
| db | none(required) | the tpcds database |
| benchmark | tpcds-v2.4-benchmark | the name of application |
| iterations | 3 | the number of iterations to run |
| filter | a | filter on the name of the queries to run, e.g. q1-v2.4 |

Example: the following command to benchmark tpcds sf10 with exists database `tpcds_sf10`.

```shell
$SPARK_HOME/bin/spark-submit \
--class org.apache.kyuubi.tpcds.benchmark.RunBenchmark \
kyuubi-tpcds-*.jar --db tpcds_sf10
```

We also support run one of the tpcds query:
```shell
$SPARK_HOME/bin/spark-submit \
--class org.apache.kyuubi.tpcds.benchmark.RunBenchmark \
kyuubi-tpcds-*.jar --db tpcds_sf10 --filter q1-v2.4
```

The result of tpcds benchmark like:

| name | minTimeMs | maxTimeMs | avgTimeMs | stdDev | stdDevPercent |
|---------|-----------|-------------|------------|----------|----------------|
| q1-v2.4 | 50.522384 | 868.010383 | 323.398267 | 471.6482 | 145.8413108576 |
29 changes: 29 additions & 0 deletions dev/kyuubi-tpcds/pom.xml
Expand Up @@ -43,6 +43,22 @@
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<scope>provided</scope>
</dependency>

<dependency>
<groupId>com.github.scopt</groupId>
<artifactId>scopt_${scala.binary.version}</artifactId>
</dependency>

<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
</dependency>

<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<scope>provided</scope>
</dependency>
</dependencies>

<build>
Expand All @@ -57,6 +73,19 @@
<skipTests>true</skipTests>
</configuration>
</plugin>

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
36 changes: 36 additions & 0 deletions dev/kyuubi-tpcds/src/main/resources/tpcds_2_4/q1.sql
@@ -0,0 +1,36 @@
--
-- Licensed to the Apache Software Foundation (ASF) under one or more
-- contributor license agreements. See the NOTICE file distributed with
-- this work for additional information regarding copyright ownership.
-- The ASF licenses this file to You under the Apache License, Version 2.0
-- (the "License"); you may not use this file except in compliance with
-- the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing, software
-- distributed under the License is distributed on an "AS IS" BASIS,
-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-- See the License for the specific language governing permissions and
-- limitations under the License.
--

--q1.sql--

WITH customer_total_return AS
(SELECT sr_customer_sk AS ctr_customer_sk, sr_store_sk AS ctr_store_sk,
sum(sr_return_amt) AS ctr_total_return
FROM store_returns, date_dim
WHERE sr_returned_date_sk = d_date_sk AND d_year = 2000
GROUP BY sr_customer_sk, sr_store_sk)
SELECT c_customer_id
FROM customer_total_return ctr1, store, customer
WHERE ctr1.ctr_total_return >
(SELECT avg(ctr_total_return)*1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
AND s_store_sk = ctr1.ctr_store_sk
AND s_state = 'TN'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id LIMIT 100

64 changes: 64 additions & 0 deletions dev/kyuubi-tpcds/src/main/resources/tpcds_2_4/q10.sql
@@ -0,0 +1,64 @@
--
-- Licensed to the Apache Software Foundation (ASF) under one or more
-- contributor license agreements. See the NOTICE file distributed with
-- this work for additional information regarding copyright ownership.
-- The ASF licenses this file to You under the Apache License, Version 2.0
-- (the "License"); you may not use this file except in compliance with
-- the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing, software
-- distributed under the License is distributed on an "AS IS" BASIS,
-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-- See the License for the specific language governing permissions and
-- limitations under the License.
--

--q10.sql--

select
cd_gender, cd_marital_status, cd_education_status, count(*) cnt1,
cd_purchase_estimate, count(*) cnt2, cd_credit_rating, count(*) cnt3,
cd_dep_count, count(*) cnt4, cd_dep_employed_count, count(*) cnt5,
cd_dep_college_count, count(*) cnt6
from
customer c, customer_address ca, customer_demographics
where
c.c_current_addr_sk = ca.ca_address_sk and
ca_county in ('Rush County','Toole County','Jefferson County',
'Dona Ana County','La Porte County') and
cd_demo_sk = c.c_current_cdemo_sk AND
exists (select * from store_sales, date_dim
where c.c_customer_sk = ss_customer_sk AND
ss_sold_date_sk = d_date_sk AND
d_year = 2002 AND
d_moy between 1 AND 1+3) AND
(exists (select * from web_sales, date_dim
where c.c_customer_sk = ws_bill_customer_sk AND
ws_sold_date_sk = d_date_sk AND
d_year = 2002 AND
d_moy between 1 AND 1+3) or
exists (select * from catalog_sales, date_dim
where c.c_customer_sk = cs_ship_customer_sk AND
cs_sold_date_sk = d_date_sk AND
d_year = 2002 AND
d_moy between 1 AND 1+3))
group by cd_gender,
cd_marital_status,
cd_education_status,
cd_purchase_estimate,
cd_credit_rating,
cd_dep_count,
cd_dep_employed_count,
cd_dep_college_count
order by cd_gender,
cd_marital_status,
cd_education_status,
cd_purchase_estimate,
cd_credit_rating,
cd_dep_count,
cd_dep_employed_count,
cd_dep_college_count
LIMIT 100

90 changes: 90 additions & 0 deletions dev/kyuubi-tpcds/src/main/resources/tpcds_2_4/q11.sql
@@ -0,0 +1,90 @@
--
-- Licensed to the Apache Software Foundation (ASF) under one or more
-- contributor license agreements. See the NOTICE file distributed with
-- this work for additional information regarding copyright ownership.
-- The ASF licenses this file to You under the Apache License, Version 2.0
-- (the "License"); you may not use this file except in compliance with
-- the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing, software
-- distributed under the License is distributed on an "AS IS" BASIS,
-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-- See the License for the specific language governing permissions and
-- limitations under the License.
--

--q11.sql--

with year_total as (
select c_customer_id customer_id
,c_first_name customer_first_name
,c_last_name customer_last_name
,c_preferred_cust_flag customer_preferred_cust_flag
,c_birth_country customer_birth_country
,c_login customer_login
,c_email_address customer_email_address
,d_year dyear
,sum(ss_ext_list_price-ss_ext_discount_amt) year_total
,'s' sale_type
from customer, store_sales, date_dim
where c_customer_sk = ss_customer_sk
and ss_sold_date_sk = d_date_sk
group by c_customer_id
,c_first_name
,c_last_name
,c_preferred_cust_flag
,c_birth_country
,c_login
,c_email_address
,d_year
union all
select c_customer_id customer_id
,c_first_name customer_first_name
,c_last_name customer_last_name
,c_preferred_cust_flag customer_preferred_cust_flag
,c_birth_country customer_birth_country
,c_login customer_login
,c_email_address customer_email_address
,d_year dyear
,sum(ws_ext_list_price-ws_ext_discount_amt) year_total
,'w' sale_type
from customer, web_sales, date_dim
where c_customer_sk = ws_bill_customer_sk
and ws_sold_date_sk = d_date_sk
group by
c_customer_id, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_country,
c_login, c_email_address, d_year)
select
t_s_secyear.customer_id
,t_s_secyear.customer_first_name
,t_s_secyear.customer_last_name
,t_s_secyear.customer_preferred_cust_flag
from year_total t_s_firstyear
,year_total t_s_secyear
,year_total t_w_firstyear
,year_total t_w_secyear
where t_s_secyear.customer_id = t_s_firstyear.customer_id
and t_s_firstyear.customer_id = t_w_secyear.customer_id
and t_s_firstyear.customer_id = t_w_firstyear.customer_id
and t_s_firstyear.sale_type = 's'
and t_w_firstyear.sale_type = 'w'
and t_s_secyear.sale_type = 's'
and t_w_secyear.sale_type = 'w'
and t_s_firstyear.dyear = 2001
and t_s_secyear.dyear = 2001+1
and t_w_firstyear.dyear = 2001
and t_w_secyear.dyear = 2001+1
and t_s_firstyear.year_total > 0
and t_w_firstyear.year_total > 0
and case when t_w_firstyear.year_total > 0 then t_w_secyear.year_total / t_w_firstyear.year_total else 0.0 end
> case when t_s_firstyear.year_total > 0 then t_s_secyear.year_total / t_s_firstyear.year_total else 0.0 end
order by
t_s_secyear.customer_id
,t_s_secyear.customer_first_name
,t_s_secyear.customer_last_name
,
t_s_secyear.customer_preferred_cust_flag
LIMIT 100

38 changes: 38 additions & 0 deletions dev/kyuubi-tpcds/src/main/resources/tpcds_2_4/q12.sql
@@ -0,0 +1,38 @@
--
-- Licensed to the Apache Software Foundation (ASF) under one or more
-- contributor license agreements. See the NOTICE file distributed with
-- this work for additional information regarding copyright ownership.
-- The ASF licenses this file to You under the Apache License, Version 2.0
-- (the "License"); you may not use this file except in compliance with
-- the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing, software
-- distributed under the License is distributed on an "AS IS" BASIS,
-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-- See the License for the specific language governing permissions and
-- limitations under the License.
--

--q12.sql--

select i_item_id,
i_item_desc, i_category, i_class, i_current_price,
sum(ws_ext_sales_price) as itemrevenue,
sum(ws_ext_sales_price)*100/sum(sum(ws_ext_sales_price)) over
(partition by i_class) as revenueratio
from
web_sales, item, date_dim
where
ws_item_sk = i_item_sk
and i_category in ('Sports', 'Books', 'Home')
and ws_sold_date_sk = d_date_sk
and d_date between cast('1999-02-22' as date)
and (cast('1999-02-22' as date) + interval '30' day)
group by
i_item_id, i_item_desc, i_category, i_class, i_current_price
order by
i_category, i_class, i_item_id, i_item_desc, revenueratio
LIMIT 100

0 comments on commit 37a4e5c

Please sign in to comment.