Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc_zh_cn] add a commercial inventory DB and carbon data query performance comparison doc chinese doc to carbondata #3521

Closed
wants to merge 1 commit into from

Conversation

MarvinLitt
Copy link
Contributor

…e doc to carbondata
as talked with likun add chinese doc file path.

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

  • Any interfaces changed?

  • Any backward compatibility impacted?

  • Document update required?

  • Testing done
    Please provide details on
    - Whether new unit test cases have been added or why no new tests are required?
    - How it is tested? Please attach test report.
    - Is it a performance related change? Please attach the performance test report.
    - Any additional information to help reviewers in testing this change.

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@CarbonDataQA1
Copy link

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1227/

@CarbonDataQA1
Copy link

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1237/

@CarbonDataQA1
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1246/

@MarvinLitt MarvinLitt changed the title [doc_zh_cn] add sybase iq and carbon data query performance comparison doc chines… [doc_zh_cn] add a commercial inventory DB and carbon data query performance comparison doc chinese doc to carbondata Dec 23, 2019
@CarbonDataQA1
Copy link

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1254/

@CarbonDataQA1
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1275/

@CarbonDataQA1
Copy link

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1264/

@CarbonDataQA1
Copy link

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1263/

@CarbonDataQA1
Copy link

Build Success with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1273/

@CarbonDataQA1
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1284/

| 集群 | 描述 |
| ---------------- | --------------------------------------------------------- |
| 某商业列存DB集群 | 3节点,SSD硬盘 |
| Hadoop集群 | 2个namenode,6个datanode,STAT硬盘,查询队列分配1/6的资源 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two cluster using the same resource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes carbonData use 1/6 resources equal one machine, Business DB query use one machine.

limitations under the License.
-->

## CarbonData 替换某商业列存DB查询性能对比
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## CarbonData 替换某商业列存DB查询性能对比
## CarbonData与商业列存DB查询性能对比

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done




## 1.集群状态对比
Copy link
Contributor

@jackylk jackylk Dec 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## 1.集群状态对比
## 1. 测试环境

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


| 集群 | 描述 |
| ---------------- | --------------------------------------------------------- |
| 某商业列存DB集群 | 3节点,SSD硬盘 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please mention CPU and memory resource

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@CarbonDataQA1
Copy link

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1344/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1353/

| carbon.sort.file.buffer.size | 20 | 在合并排序(读/写)操作时存储每个临时过程文件的所使用的总缓存大小。单位为MB |
| carbon.sort.size | 500000 | 在数据加载操作时,每次被排序的记录数。 |
| Spark主要配置 | | |
| spark.sql.shuffle.partitions | 70 | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些配置的描述也请加一些简单描述吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


```Spark的查询SQL模型:```

SELECT COALESCE(SUM(COLUMN_A), 0) + COALESCE(SUM(COLUMN_B), 0) AS COLUMN_C , COALESCE(SUM(COLUMN_A), 0) AS COLUMN_A_A , COALESCE(SUM(COLUMN_B), 0) AS COLUMN_B_B , COALESCE(SUM(COLUMN_D), 0) + COALESCE(SUM(COLUMN_E), 0) AS COLUMN_F , COALESCE(SUM(COLUMN_D), 0) AS COLUMN_D_D , COALESCE(SUM(COLUMN_E), 0) AS COLUMN_E_E , (COALESCE(SUM(COLUMN_A), 0) + COALESCE(SUM(COLUMN_B), 0)) * delta AS COLUMN_F , COALESCE(SUM(COLUMN_A), 0) * delta AS COLUMN_G , COALESCE(SUM(COLUMN_B), 0) * delta AS COLUMN_H , MT.`TEMP` AS `TEMP` FROM ( SELECT `COLUMN_1_A` AS COLUMN_A, `COLUMN_1_E` AS COLUMN_E, `COLUMN_1_B` AS COLUMN_B, `COLUMN_1_D` AS COLUMN_D, TABLE_A.`TEMP` AS `TEMP` FROM TABLE_B LEFT JOIN ( SELECT `COLUMN_CSI` AS `TEMP2` , CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_CSI` END AS `TEMP` , CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_NAME` END AS NAME_TEMP FROM DIMENSION_TABLE GROUP BY `COLUMN_CSI`, CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_CSI` END, CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_NAME` END ) TABLE_A ON `COLUMN_CSI` = TABLE_A.`TEMP2` WHERE TABLE_A.NAME_TEMP IS NOT NULL AND `TIME` >= A AND `TIME` < B ) MT GROUP BY MT.`TEMP` ORDER BY COLUMN_C DESC LIMIT 5000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
SELECT COALESCE(SUM(COLUMN_A), 0) + COALESCE(SUM(COLUMN_B), 0) AS COLUMN_C , COALESCE(SUM(COLUMN_A), 0) AS COLUMN_A_A , COALESCE(SUM(COLUMN_B), 0) AS COLUMN_B_B , COALESCE(SUM(COLUMN_D), 0) + COALESCE(SUM(COLUMN_E), 0) AS COLUMN_F , COALESCE(SUM(COLUMN_D), 0) AS COLUMN_D_D , COALESCE(SUM(COLUMN_E), 0) AS COLUMN_E_E , (COALESCE(SUM(COLUMN_A), 0) + COALESCE(SUM(COLUMN_B), 0)) * delta AS COLUMN_F , COALESCE(SUM(COLUMN_A), 0) * delta AS COLUMN_G , COALESCE(SUM(COLUMN_B), 0) * delta AS COLUMN_H , MT.`TEMP` AS `TEMP` FROM ( SELECT `COLUMN_1_A` AS COLUMN_A, `COLUMN_1_E` AS COLUMN_E, `COLUMN_1_B` AS COLUMN_B, `COLUMN_1_D` AS COLUMN_D, TABLE_A.`TEMP` AS `TEMP` FROM TABLE_B LEFT JOIN ( SELECT `COLUMN_CSI` AS `TEMP2` , CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_CSI` END AS `TEMP` , CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_NAME` END AS NAME_TEMP FROM DIMENSION_TABLE GROUP BY `COLUMN_CSI`, CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_CSI` END, CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_NAME` END ) TABLE_A ON `COLUMN_CSI` = TABLE_A.`TEMP2` WHERE TABLE_A.NAME_TEMP IS NOT NULL AND `TIME` >= A AND `TIME` < B ) MT GROUP BY MT.`TEMP` ORDER BY COLUMN_C DESC LIMIT 5000
```SQL
SELECT
COALESCE(SUM(COLUMN_A), 0) + COALESCE(SUM(COLUMN_B), 0) AS COLUMN_C ,
COALESCE(SUM(COLUMN_A), 0) AS COLUMN_A_A ,
COALESCE(SUM(COLUMN_B), 0) AS COLUMN_B_B ,
COALESCE(SUM(COLUMN_D), 0) + COALESCE(SUM(COLUMN_E), 0) AS COLUMN_F ,
COALESCE(SUM(COLUMN_D), 0) AS COLUMN_D_D ,
COALESCE(SUM(COLUMN_E), 0) AS COLUMN_E_E ,
(COALESCE(SUM(COLUMN_A), 0) + COALESCE(SUM(COLUMN_B), 0)) * delta AS COLUMN_F ,
COALESCE(SUM(COLUMN_A), 0) * delta AS COLUMN_G ,
COALESCE(SUM(COLUMN_B), 0) * delta AS COLUMN_H ,
MT.`TEMP` AS `TEMP`
FROM (
SELECT
`COLUMN_1_A` AS COLUMN_A,
`COLUMN_1_E` AS COLUMN_E,
`COLUMN_1_B` AS COLUMN_B,
`COLUMN_1_D` AS COLUMN_D,
TABLE_A.`TEMP` AS `TEMP`
FROM TABLE_B LEFT JOIN (
SELECT
`COLUMN_CSI` AS `TEMP2` ,
CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_CSI` END AS `TEMP` ,
CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_NAME` END AS NAME_TEMP
FROM DIMENSION_TABLE
GROUP BY
`COLUMN_CSI`,
CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_CSI` END,
CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_NAME` END
) TABLE_A
ON `COLUMN_CSI` = TABLE_A.`TEMP2`
WHERE
TABLE_A.NAME_TEMP IS NOT NULL AND
`TIME` >= A AND `TIME` < B
) MT
GROUP BY MT.`TEMP`
ORDER BY COLUMN_C DESC
LIMIT 5000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


```某商业列存DB的查询SQL模型:```

SELECT TOP 5000 SUM(COALESCE(COLUMN_A, 0)) + SUM(COALESCE(COLUMN_B, 0)) AS COLUMN_C , SUM(COALESCE(COLUMN_A, 0)) AS COLUMN_A_A , SUM(COALESCE(COLUMN_B, 0)) AS COLUMN_B_B , SUM(COALESCE(COLUMN_D, 0)) + SUM(COALESCE(COLUMN_E, 0)) AS COLUMN_F , SUM(COALESCE(COLUMN_D, 0)) AS COLUMN_D_D , SUM(COALESCE(COLUMN_E, 0)) AS COLUMN_E_E , (SUM(COALESCE(COLUMN_A, 0)) + SUM(COALESCE(COLUMN_B, 0))) * delta AS COLUMN_F , SUM(COALESCE(COLUMN_A, 0)) * delta AS COLUMN_G , SUM(COALESCE(COLUMN_B, 0)) * delta AS COLUMN_H , MT."TEMP" AS "TEMP", COUNT(1) OVER () AS countNum FROM ( SELECT COALESCE(SUM("COLUMN_1_A"), 0) AS COLUMN_A , COALESCE(SUM("COLUMN_1_B"), 0) AS COLUMN_B , COALESCE(SUM("COLUMN_1_E"), 0) AS COLUMN_E , COALESCE(SUM("COLUMN_1_D"), 0) AS COLUMN_D , TABLE_A."TEMP" AS "TEMP" FROM TABLE_B LEFT JOIN ( SELECT "COLUMN_CSI" AS "TEMP2" , CASE WHEN "TYPE_ID" = 2 THEN "COLUMN_CSI" END AS "TEMP" , CASE WHEN "TYPE_ID" = 2 THEN "CLOUMN_NAME" END AS NAME_TEMP FROM DIMENSION_TABLE GROUP BY "COLUMN_CSI", CASE WHEN "TYPE_ID" = 2 THEN "COLUMN_CSI" END, CASE WHEN "TYPE_ID" = 2 THEN "CLOUMN_NAME" END ) TABLE_A ON "COLUMN_CSI" = TABLE_A."TEMP2" WHERE TABLE_A.NAME_TEMP IS NOT NULL AND "TIME" < A AND "TIME" >= B GROUP BY TABLE_A."TEMP" ) MT GROUP BY MT."TEMP" ORDER BY COLUMN_C DESC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not required.
User can know about the SQL statement by line 47

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


## 2.查询SQL模型介绍

某商业列存DB与CarbonData查询SQL本身存在差异,在执行性能测试之前需要对SQL进行修改。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

| 某商业列存DB集群 | 3节点,SSD硬盘 |
| Hadoop集群 | 2个namenode,6个datanode,STAT硬盘,查询队列分配1/6的资源 |

## 2.查询SQL模型介绍
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## 2.查询SQL模型介绍
## 2. SQL测试语句

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@CarbonDataQA1
Copy link

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1363/

@CarbonDataQA1
Copy link

Build Success with Spark 2.1.0, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.1/1364/

@CarbonDataQA1
Copy link

Build Failed with Spark 2.2.1, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.2/1374/

@CarbonDataQA1
Copy link

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1385/

@asfgit asfgit closed this in d1dadd5 Jan 3, 2020
@jackylk
Copy link
Contributor

jackylk commented Jan 3, 2020

LGTM

Indhumathi27 pushed a commit to Indhumathi27/carbondata that referenced this pull request Jan 3, 2020
MarvinLitt added a commit to MarvinLitt/carbondata that referenced this pull request Jan 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants