From 0fad0ec79b1ddd7daf80eda54ccfc3daf20ab220 Mon Sep 17 00:00:00 2001 From: Jacky Li Date: Sat, 3 Mar 2018 13:40:59 +0800 Subject: [PATCH 1/2] change --- docs/datamap/preaggregate-datamap-guide.md | 51 ++++++++++++++++--- docs/datamap/timeseries-datamap-guide.md | 23 ++++++--- .../examples/PreAggregateTableExample.scala | 2 + 3 files changed, 64 insertions(+), 12 deletions(-) diff --git a/docs/datamap/preaggregate-datamap-guide.md b/docs/datamap/preaggregate-datamap-guide.md index fabfd7d5927..115dd8224a7 100644 --- a/docs/datamap/preaggregate-datamap-guide.md +++ b/docs/datamap/preaggregate-datamap-guide.md @@ -1,5 +1,13 @@ # CarbonData Pre-aggregate DataMap +* [Quick Example](#quick-example) +* [DataMap Management](#datamap-management) +* [Pre-aggregate Table](#preaggregate-datamap-introduction) +* [Loading Data](#loading-data) +* [Querying Data](#querying-data) +* [Compaction](#compacting-pre-aggregate-tables) +* [Data Management](#data-management-with-pre-aggregate-tables) + ## Quick example Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME @@ -85,7 +93,35 @@ Start spark-shell in new terminal, type :paste, then copy and run the following spark.stop ``` -##PRE-AGGREGATE DataMap +#### DataMap Management +DataMap can be created using following DDL + ``` + CREATE DATAMAP [IF NOT EXISTS] datamap_name + ON TABLE main_table + USING "datamap_provider" + DMPROPERTIES ('key'='value', ...) + AS + SELECT statement + ``` +The string followed by USING is called DataMap Provider, in this version CarbonData supports two +kinds of DataMap: +1. preaggregate, for pre-aggregate table. No DMPROPERTY is required for this DataMap +2. timeseries, for timeseries roll-up table. Please refer to [Timeseries DataMap](https://github.com/apache/carbondata/blob/master/docs/datamap/timeseries-datamap-guide.md) + +DataMap can be dropped using following DDL + ``` + DROP DATAMAP [IF EXISTS] datamap_name + ON TABLE main_table + ``` +To show all DataMaps created, use: + ``` + SHOW DATAMAP + ON TABLE main_table + ``` +It will show all DataMaps created on main table. + + +## Preaggregate DataMap Introduction Pre-aggregate tables are created as DataMaps and managed as tables internally by CarbonData. User can create as many pre-aggregate datamaps required to improve query performance, provided the storage requirements and loading speeds are acceptable. @@ -163,7 +199,7 @@ SELECT country, max(price) from sales GROUP BY country will query against main table **sales** only, because it does not satisfy pre-aggregate table selection logic. -#### Loading data to pre-aggregate tables +## Loading data For existing table with loaded data, data load to pre-aggregate table will be triggered by the CREATE DATAMAP statement when user creates the pre-aggregate table. For incremental loads after aggregates tables are created, loading data to main table triggers the load to pre-aggregate tables @@ -174,7 +210,7 @@ meaning that data on main table and pre-aggregate tables are only visible to the tables are loaded successfully, if one of these loads fails, new data are not visible in all tables as if the load operation is not happened. -#### Querying data from pre-aggregate tables +## Querying data As a technique for query acceleration, Pre-aggregate tables cannot be queries directly. Queries are to be made on main table. While doing query planning, internally CarbonData will check associated pre-aggregate tables with the main table, and do query plan transformation accordingly. @@ -183,7 +219,8 @@ User can verify whether a query can leverage pre-aggregate table or not by execu command, which will show the transformed logical plan, and thus user can check whether pre-aggregate table is selected. -#### Compacting pre-aggregate tables + +## Compacting pre-aggregate tables Running Compaction command (`ALTER TABLE COMPACT`) on main table will **not automatically** compact the pre-aggregate tables created on the main table. User need to run Compaction command separately on each pre-aggregate table to compact them. @@ -193,8 +230,10 @@ main table but not performed on pre-aggregate table, all queries still can benef pre-aggregate tables. To further improve the query performance, compaction on pre-aggregate tables can be triggered to merge the segments and files in the pre-aggregate tables. -#### Data Management on pre-aggregate tables -Once there is pre-aggregate table created on the main table, following command on the main table +## Data Management with pre-aggregate tables +In current implementation, data consistence need to maintained for both main table and pre-aggregate +tables. Once there is pre-aggregate table created on the main table, following command on the main +table is not supported: 1. Data management command: `UPDATE/DELETE/DELETE SEGMENT`. 2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE DATATYPE`, diff --git a/docs/datamap/timeseries-datamap-guide.md b/docs/datamap/timeseries-datamap-guide.md index ecd7234064f..886c16173ea 100644 --- a/docs/datamap/timeseries-datamap-guide.md +++ b/docs/datamap/timeseries-datamap-guide.md @@ -1,14 +1,25 @@ # CarbonData Timeseries DataMap -## Supporting timeseries data (Alpha feature in 1.3.0) +* [Timeseries DataMap](#timeseries-datamap-intoduction-(alpha-feature-in-1.3.0)) +* [Compaction](#compacting-pre-aggregate-tables) +* [Data Management](#data-management-with-pre-aggregate-tables) + +## Timeseries DataMap Intoduction (Alpha feature in 1.3.0) Timeseries DataMap a pre-aggregate table implementation based on 'preaggregate' DataMap. Difference is that Timerseries DataMap has built-in understanding of time hierarchy and levels: year, month, day, hour, minute, so that it supports automatic roll-up in time dimension for query. + +The data loading, querying, compaction command and its behavior is the same as preaggregate DataMap. +Please refer to [Pre-aggregate DataMap](https://github.com/apache/carbondata/blob/master/docs/datamap/preaggregate-datamap-guide.md) +for more information. -For instance, user can create multiple timeseries datamap on the main table which has a *event_time* -column, one datamap for one time granularity. Then Carbondata can do automatic roll-up for queries -on the main table. +To use this datamap, user can create multiple timeseries datamap on the main table which has +a *event_time* column, one datamap for one time granularity. Then Carbondata can do automatic +roll-up for queries on the main table. + +For example, below statement effectively create multiple pre-aggregate tables on main table called +**timeseries** ``` CREATE DATAMAP agg_year @@ -126,10 +137,10 @@ the future CarbonData release. * timeseries datamaps created for each level needs to be dropped separately -#### Compacting timeseries datamp +## Compacting timeseries datamp Refer to Compaction section in [preaggregation datamap](https://github.com/apache/carbondata/blob/master/docs/datamap/preaggregate-datamap-guide.md). Same applies to timeseries datamap. -#### Data Management on timeseries datamap +## Data Management on timeseries datamap Refer to Data Management section in [preaggregation datamap](https://github.com/apache/carbondata/blob/master/docs/datamap/preaggregate-datamap-guide.md). Same applies to timeseries datamap. \ No newline at end of file diff --git a/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala b/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala index ace3dcc53e8..d6a410dca43 100644 --- a/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala +++ b/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala @@ -99,6 +99,8 @@ object PreAggregateTableExample { s"""create datamap preagg_count on table maintable using 'preaggregate' as | select name, count(*) from maintable group by name""".stripMargin) + spark.sql("show datamap on table maintable").show + spark.sql( s""" | SELECT id,max(age) From 568349948ee5c2a299f5e6522f5395e399734eee Mon Sep 17 00:00:00 2001 From: Jacky Li Date: Sat, 3 Mar 2018 16:36:00 +0800 Subject: [PATCH 2/2] fix comment --- docs/datamap/preaggregate-datamap-guide.md | 2 +- .../apache/carbondata/examples/PreAggregateTableExample.scala | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/datamap/preaggregate-datamap-guide.md b/docs/datamap/preaggregate-datamap-guide.md index 115dd8224a7..199f674c08a 100644 --- a/docs/datamap/preaggregate-datamap-guide.md +++ b/docs/datamap/preaggregate-datamap-guide.md @@ -231,7 +231,7 @@ pre-aggregate tables. To further improve the query performance, compaction on pr can be triggered to merge the segments and files in the pre-aggregate tables. ## Data Management with pre-aggregate tables -In current implementation, data consistence need to maintained for both main table and pre-aggregate +In current implementation, data consistence need to be maintained for both main table and pre-aggregate tables. Once there is pre-aggregate table created on the main table, following command on the main table is not supported: diff --git a/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala b/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala index d6a410dca43..64ed52508a8 100644 --- a/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala +++ b/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala @@ -100,7 +100,7 @@ object PreAggregateTableExample { | select name, count(*) from maintable group by name""".stripMargin) spark.sql("show datamap on table maintable").show - + spark.sql( s""" | SELECT id,max(age)