Skip to content

Commit

Permalink
add a commercial inventory DB and carbon data query performance compa…
Browse files Browse the repository at this point in the history
…rison doc chinese doc to carbondata
  • Loading branch information
MarvinLitt committed Dec 30, 2019
1 parent fdcfcbf commit 533f8ed
Showing 1 changed file with 115 additions and 0 deletions.
115 changes: 115 additions & 0 deletions docs/zh_cn/CarbonData与商业列存DB性能对比.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

## CarbonData与商业列存DB性能对比

本文主要在于给用户呈现CarbonData在对比某商业列存DB过程中对于该DB的查询性能提升,CarbonData自身的优势和特点,本文的数据仅为基于某领域查询特点框架下SQL的查询结果,只代表该特定查询特点下的性能对比。





## 1.集群状态对比

查询时某商业列存DB一台查询节点,配置SSD硬盘。CarbonData6个DataNode,配置STAT硬盘,但是查询队列设置1/6的资源,等同于1台商业DB服务器对比1台CarbonData服务器的查询性能。同时CarbonData使用的服务器的磁盘是STAT盘,成本比某商业列存DB服务器低。

| 集群 | 描述 |
| ---------------- | --------------------------------------------------------- |
| 某商业列存DB集群 | 3节点,SSD硬盘 |
| Hadoop集群 | 2个namenode,6个datanode,STAT硬盘,查询队列分配1/6的资源 |

## 2.查询SQL模型介绍

某商业列存DB与CarbonData查询SQL本身存在差异,在执行性能测试之前需要对SQL进行修改。

```某商业列存DB的查询SQL模型:```

SELECT TOP 5000 SUM(COALESCE(COLUMN_A, 0)) + SUM(COALESCE(COLUMN_B, 0)) AS COLUMN_C , SUM(COALESCE(COLUMN_A, 0)) AS COLUMN_A_A , SUM(COALESCE(COLUMN_B, 0)) AS COLUMN_B_B , SUM(COALESCE(COLUMN_D, 0)) + SUM(COALESCE(COLUMN_E, 0)) AS COLUMN_F , SUM(COALESCE(COLUMN_D, 0)) AS COLUMN_D_D , SUM(COALESCE(COLUMN_E, 0)) AS COLUMN_E_E , (SUM(COALESCE(COLUMN_A, 0)) + SUM(COALESCE(COLUMN_B, 0))) * delta AS COLUMN_F , SUM(COALESCE(COLUMN_A, 0)) * delta AS COLUMN_G , SUM(COALESCE(COLUMN_B, 0)) * delta AS COLUMN_H , MT."TEMP" AS "TEMP", COUNT(1) OVER () AS countNum FROM ( SELECT COALESCE(SUM("COLUMN_1_A"), 0) AS COLUMN_A , COALESCE(SUM("COLUMN_1_B"), 0) AS COLUMN_B , COALESCE(SUM("COLUMN_1_E"), 0) AS COLUMN_E , COALESCE(SUM("COLUMN_1_D"), 0) AS COLUMN_D , TABLE_A."TEMP" AS "TEMP" FROM TABLE_B LEFT JOIN ( SELECT "COLUMN_CSI" AS "TEMP2" , CASE WHEN "TYPE_ID" = 2 THEN "COLUMN_CSI" END AS "TEMP" , CASE WHEN "TYPE_ID" = 2 THEN "CLOUMN_NAME" END AS NAME_TEMP FROM DIMENSION_TABLE GROUP BY "COLUMN_CSI", CASE WHEN "TYPE_ID" = 2 THEN "COLUMN_CSI" END, CASE WHEN "TYPE_ID" = 2 THEN "CLOUMN_NAME" END ) TABLE_A ON "COLUMN_CSI" = TABLE_A."TEMP2" WHERE TABLE_A.NAME_TEMP IS NOT NULL AND "TIME" < A AND "TIME" >= B GROUP BY TABLE_A."TEMP" ) MT GROUP BY MT."TEMP" ORDER BY COLUMN_C DESC

其中一个SUM后面称为一个counter

```Spark的查询SQL模型:```

SELECT COALESCE(SUM(COLUMN_A), 0) + COALESCE(SUM(COLUMN_B), 0) AS COLUMN_C , COALESCE(SUM(COLUMN_A), 0) AS COLUMN_A_A , COALESCE(SUM(COLUMN_B), 0) AS COLUMN_B_B , COALESCE(SUM(COLUMN_D), 0) + COALESCE(SUM(COLUMN_E), 0) AS COLUMN_F , COALESCE(SUM(COLUMN_D), 0) AS COLUMN_D_D , COALESCE(SUM(COLUMN_E), 0) AS COLUMN_E_E , (COALESCE(SUM(COLUMN_A), 0) + COALESCE(SUM(COLUMN_B), 0)) * delta AS COLUMN_F , COALESCE(SUM(COLUMN_A), 0) * delta AS COLUMN_G , COALESCE(SUM(COLUMN_B), 0) * delta AS COLUMN_H , MT.`TEMP` AS `TEMP` FROM ( SELECT `COLUMN_1_A` AS COLUMN_A, `COLUMN_1_E` AS COLUMN_E, `COLUMN_1_B` AS COLUMN_B, `COLUMN_1_D` AS COLUMN_D, TABLE_A.`TEMP` AS `TEMP` FROM TABLE_B LEFT JOIN ( SELECT `COLUMN_CSI` AS `TEMP2` , CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_CSI` END AS `TEMP` , CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_NAME` END AS NAME_TEMP FROM DIMENSION_TABLE GROUP BY `COLUMN_CSI`, CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_CSI` END, CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_NAME` END ) TABLE_A ON `COLUMN_CSI` = TABLE_A.`TEMP2` WHERE TABLE_A.NAME_TEMP IS NOT NULL AND `TIME` >= A AND `TIME` < B ) MT GROUP BY MT.`TEMP` ORDER BY COLUMN_C DESC LIMIT 5000

## 3.CarbonData主要配置参数

```主要配置```

| CarbonData主要配置 | 参数值 | 描述 |
| ------------------------------------ | ------ | ------------------------------------------------------------ |
| carbon.inmemory.record.size | 480000 | 查询每个表需要加载到内存的总行数。 |
| carbon.number.of.cores | 4 | carbon查询过程中并行扫描的线程数。 |
| carbon.number.of.cores.while.loading | 15 | carbon数据加载过程中并行扫描的线程数。 |
| carbon.sort.file.buffer.size | 20 | 在合并排序(读/写)操作时存储每个临时过程文件的所使用的总缓存大小。单位为MB |
| carbon.sort.size | 500000 | 在数据加载操作时,每次被排序的记录数。 |
| Spark主要配置 | | |
| spark.sql.shuffle.partitions | 70 | |
| spark.executor.instances | 6 | |
| spark.executor.cores | 13 | |
| spark.locality.wait | 0 | |
| spark.executor.memory | 5G | |
| spark.driver.cores | 3 | |
| spark.driver.memory | 50G | |
| spark.sql.codegen.wholeStage | True | |
| spark.sql.codegen.hugeMethodLimit | 8000 | |

## 4.不同数量级查询性能对比结果:

某商业列存DB与CarbonData的查询均为取多次求平均值。

| 表的分类:数据量+counter个数 | 表记录数(条) | counter 个数 | 某商业列存DB 5次 查询平均耗时(s) | CarbonData 5次查询平均耗时(s) |
| ---------------------------- | -------------- | ------------ | ---------------------------------- | ------------------------------- |
| 100K_9Counter | 100K | 9Counter | 0.91 | 3.53 |
| 100K_18Counter | 100K | 18Counter | 1.30 | 3.81 |
| 100K_36Counter | 100K | 36Counter | 1.87 | 4.29 |
| 100K_72Counter | 100K | 72Counter | 3.82 | 5.09 |
| 500K_9Counter | 500K | 9Counter | 1.47 | 4.04 |
| 500K_18Counter | 500K | 18Counter | 1.98 | 4.61 |
| 500K_36Counter | 500K | 36Counter | 2.99 | 5.63 |
| 500K_72Counter | 500K | 72Counter | 5.67 | 7.53 |
| 1M_9Counter | 1M | 9Counter | 4.72 | 4.24 |
| 1M_18Counter | 1M | 18Counter | 5.13 | 4.84 |
| 1M_36Counter | 1M | 36Counter | 6.55 | 5.83 |
| 1M_72Counter | 1M | 72Counter | 10.83 | 7.90 |
| 5M_9Counter | 5M | 9Counter | 5.82 | 4.59 |
| 5M_18Counter | 5M | 18Counter | 7.70 | 5.26 |
| 5M_36Counter | 5M | 36Counter | 11.32 | 6.73 |
| 5M_72Counter | 5M | 72Counter | 21.78 | 9.27 |
| 10M_9Counter | 10M | 9Counter | 7.98 | 5.32 |
| 10M_18Counter | 10M | 18Counter | 11.39 | 6.03 |
| 10M_36Counter | 10M | 36Counter | 17.40 | 7.43 |
| 10M_72Counter | 10M | 72Counter | 34.50 | 10.48 |
| 50M_9Counter | 50M | 9Counter | 16.89 | 8.95 |
| 50M_18Counter | 50M | 18Counter | 25.50 | 10.42 |
| 50M_36Counter | 50M | 36Counter | 268.10 | 12.78 |
| 50M_72Counter | 50M | 72Counter | 554.16 | 18.79 |
| 100M_9Counter | 100M | 9Counter | 25.13 | 13.19 |
| 100M_18Counter | 100M | 18Counter | 35.57 | 14.87 |
| 100M_36Counter | 100M | 36Counter | 299.43 | 18.96 |
| 100M_72Counter | 100M | 72Counter | 678.72 | 28.12 |
| 1B_9Counter | 1B | 9Counter | 167.50 | 47.95 |
| 1B_18Counter | 1B | 18Counter | 261.20 | 55.79 |
| 1B_36Counter | 1B | 36Counter | 654.99 | 73.14 |
| 1B_72Counter | 1B | 72Counter | 1575.81 | 116.63 |

## 5.总结

​ 通过上面的结果可以看出,在百万级及以上数据量的查询中CarbonData的查询性能明显高于某商业列存DB,整体查询性能有了较高的提升,并且提升的幅度比较大。随着数据量的增大,CarbonData的查询优势更加明显。

​ 在同等物理资源甚至劣势资源的情况下CarbonData的查询性能要高于某商业列存DB,平均查询性能提升1.5-9倍。

0 comments on commit 533f8ed

Please sign in to comment.