Skip to content

Commit

Permalink
[CARBONDATA-2915] Reformat Documentation of CarbonData
Browse files Browse the repository at this point in the history
1.Split Our carbondata command into DDL and DML,2.Add Presto integration along with Spark into quick start,3.Add a master reference manual which lists all the commands supported in carbondata.This manual shall have links to DDL and DML supported, 4.Add a introduction to carbondata covering architecture,design and features supported, 5.Merge FAQ and troubleshooting documents into single document , 6.Add a separate md file to explain user how to navigate across our documentation, 7.Add the TOC (Table of Contents) to all the md files which has multiple sections , 8.Add list of supported properties at the beginning of each DDL or DML so that user knows all the properties that are supported, 9.Rewrite the configuration properties description to explain the property in bit more detail and also highlight when to use the command and any caveats, 10.ReOrder our configuration properties table to group features wise, 11.Change the grammar and sentences.

This closes #2693
  • Loading branch information
sraghunandan authored and chenliang613 committed Sep 7, 2018
1 parent 3894e1d commit 6e50c1c
Show file tree
Hide file tree
Showing 37 changed files with 2,454 additions and 1,107 deletions.
8 changes: 4 additions & 4 deletions docs/configuration-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
-->

# Configuring CarbonData
This guide explains the configurations that can be used to tune CarbonData to achieve better performance.Some of the properties can be set dynamically and are explained in the section Dynamic Configuration In CarbonData Using SET-RESET.Most of the properties that control the internal settings have reasonable default values.They are listed along with the properties along with explanation.
This guide explains the configurations that can be used to tune CarbonData to achieve better performance.Most of the properties that control the internal settings have reasonable default values.They are listed along with the properties along with explanation.

* [System Configuration](#system-configuration)
* [Data Loading Configuration](#data-loading-configuration)
Expand Down Expand Up @@ -59,7 +59,7 @@ This section provides the details of all the configurations required for the Car
| carbon.bad.records.action | FAIL | CarbonData in addition to identifying the bad records, can take certain actions on such data.This configuration can have four types of actions for bad records namely FORCE, REDIRECT, IGNORE and FAIL. If set to FORCE then it auto-corrects the data by storing the bad records as NULL. If set to REDIRECT then bad records are written to the raw CSV instead of being loaded. If set to IGNORE then bad records are neither loaded nor written to the raw CSV. If set to FAIL then data loading fails if any bad records are found. |
| carbon.options.is.empty.data.bad.record | false | Based on the business scenarios, empty("" or '' or ,,) data can be valid or invalid. This configuration controls how empty data should be treated by CarbonData. If false, then empty ("" or '' or ,,) data will not be considered as bad record and vice versa. |
| carbon.options.bad.record.path | (none) | Specifies the HDFS path where bad records are to be stored. By default the value is Null. This path must to be configured by the user if ***carbon.options.bad.records.logger.enable*** is **true** or ***carbon.bad.records.action*** is **REDIRECT**. |
| carbon.blockletgroup.size.in.mb | 64 | Please refer to [file-structure-of-carbondata](../file-structure-of-carbondata.md ) to understand the storage format of CarbonData.The data are read as a group of blocklets which are called blocklet groups. This parameter specifies the size of each blocklet group. Higher value results in better sequential IO access.The minimum value is 16MB, any value lesser than 16MB will reset to the default value (64MB).**NOTE:** Configuring a higher value might lead to poor performance as an entire blocklet group will have to read into memory before processing.For filter queries with limit, it is **not advisable** to have a bigger blocklet size.For Aggregation queries which need to return more number of rows,bigger blocklet size is advisable. |
| carbon.blockletgroup.size.in.mb | 64 | Please refer to [file-structure-of-carbondata](./file-structure-of-carbondata.md#carbondata-file-format) to understand the storage format of CarbonData.The data are read as a group of blocklets which are called blocklet groups. This parameter specifies the size of each blocklet group. Higher value results in better sequential IO access.The minimum value is 16MB, any value lesser than 16MB will reset to the default value (64MB).**NOTE:** Configuring a higher value might lead to poor performance as an entire blocklet group will have to read into memory before processing.For filter queries with limit, it is **not advisable** to have a bigger blocklet size.For Aggregation queries which need to return more number of rows,bigger blocklet size is advisable. |
| carbon.sort.file.write.buffer.size | 16384 | CarbonData sorts and writes data to intermediate files to limit the memory usage.This configuration determines the buffer size to be used for reading and writing such files. **NOTE:** This configuration is useful to tune IO and derive optimal performance.Based on the OS and underlying harddisk type, these values can significantly affect the overall performance.It is ideal to tune the buffersize equivalent to the IO buffer size of the OS.Recommended range is between 10240 to 10485760 bytes. |
| carbon.sort.intermediate.files.limit | 20 | CarbonData sorts and writes data to intermediate files to limit the memory usage.Before writing the target carbondat file, the data in these intermediate files needs to be sorted again so as to ensure the entire data in the data load is sorted.This configuration determines the minimum number of intermediate files after which merged sort is applied on them sort the data.**NOTE:** Intermediate merging happens on a separate thread in the background.Number of threads used is determined by ***carbon.merge.sort.reader.thread***.Configuring a low value will cause more time to be spent in merging these intermediate merged files which can cause more IO.Configuring a high value would cause not to use the idle threads to do intermediate sort merges.Range of recommended values are between 2 and 50 |
| carbon.csv.read.buffersize.byte | 1048576 | CarbonData uses Hadoop InputFormat to read the csv files.This configuration value is used to pass buffer size as input for the Hadoop MR job when reading the csv files.This value is configured in bytes.**NOTE:** Refer to ***org.apache.hadoop.mapreduce.InputFormat*** documentation for additional information. |
Expand All @@ -70,7 +70,7 @@ This section provides the details of all the configurations required for the Car
| carbon.enable.calculate.size | true | **For Load Operation**: Setting this property calculates the size of the carbon data file (.carbondata) and carbon index file (.carbonindex) for every load and updates the table status file. **For Describe Formatted**: Setting this property calculates the total size of the carbon data files and carbon index files for the respective table and displays in describe formatted command.**NOTE:** This is useful to determine the overall size of the carbondata table and also get an idea of how the table is growing in order to take up other backup strategy decisions. |
| carbon.cutOffTimestamp | (none) | CarbonData has capability to generate the Dictionary values for the timestamp columns from the data itself without the need to store the computed dictionary values. This configuration sets the start date for calculating the timestamp. Java counts the number of milliseconds from start of "1970-01-01 00:00:00". This property is used to customize the start of position. For example "2000-01-01 00:00:00". **NOTE:** The date must be in the form ***carbon.timestamp.format***. CarbonData supports storing data for upto 68 years.For example, if the cut-off time is 1970-01-01 05:30:00, then data upto 2038-01-01 05:30:00 will be supported by CarbonData. |
| carbon.timegranularity | SECOND | The configuration is used to specify the data granularity level such as DAY, HOUR, MINUTE, or SECOND.This helps to store more than 68 years of data into CarbonData. |
| carbon.use.local.dir | false | CarbonData during data loading, writes files to local temp directories before copying the files to HDFS.This configuration is used to specify whether CarbonData can write locally to tmp directory of the container or to the YARN application directory. |
| carbon.use.local.dir | false | CarbonData,during data loading, writes files to local temp directories before copying the files to HDFS.This configuration is used to specify whether CarbonData can write locally to tmp directory of the container or to the YARN application directory. |
| carbon.use.multiple.temp.dir | false | When multiple disks are present in the system, YARN is generally configured with multiple disks to be used as temp directories for managing the containers.This configuration specifies whether to use multiple YARN local directories during data loading for disk IO load balancing.Enable ***carbon.use.local.dir*** for this configuration to take effect.**NOTE:** Data Loading is an IO intensive operation whose performance can be limited by the disk IO threshold, particularly during multi table concurrent data load.Configuring this parameter, balances the disk IO across multiple disks there by improving the over all load performance. |
| carbon.sort.temp.compressor | (none) | CarbonData writes every ***carbon.sort.size*** number of records to intermediate temp files during data loading to ensure memory footprint is within limits.These temporary files cab be compressed and written in order to save the storage space.This configuration specifies the name of compressor to be used to compress the intermediate sort temp files during sort procedure in data loading.The valid values are 'SNAPPY','GZIP','BZIP2','LZ4','ZSTD' and empty. By default, empty means that Carbondata will not compress the sort temp files.**NOTE:** Compressor will be useful if you encounter disk bottleneck.Since the data needs to be compressed and decompressed,it involves additional CPU cycles,but is compensated by the high IO throughput due to less data to be written or read from the disks. |
| carbon.load.skewedDataOptimization.enabled | false | During data loading,CarbonData would divide the number of blocks equally so as to ensure all executors process same number of blocks.This mechanism satisfies most of the scenarios and ensures maximum parallel processing for optimal data loading performance.In some business scenarios, there might be scenarios where the size of blocks vary significantly and hence some executors would have to do more work if they get blocks containing more data. This configuration enables size based block allocation strategy for data loading.When loading, carbondata will use file size based block allocation strategy for task distribution. It will make sure that all the executors process the same size of data.**NOTE:** This configuration is useful if the size of your input data files varies widely, say 1MB~1GB.For this configuration to work effectively,knowing the data pattern and size is important and necessary. |
Expand Down Expand Up @@ -107,7 +107,7 @@ This section provides the details of all the configurations required for the Car
| carbon.numberof.preserve.segments | 0 | If the user wants to preserve some number of segments from being compacted then he can set this configuration. Example: carbon.numberof.preserve.segments = 2 then 2 latest segments will always be excluded from the compaction. No segments will be preserved by default.**NOTE:** This configuration is useful when the chances of input data can be wrong due to environment scenarios.Preserving some of the latest segments from being compacted can help to easily delete the wrongly loaded segments.Once compacted,it becomes more difficult to determine the exact data to be deleted(except when data is incrementing according to time) |
| carbon.allowed.compaction.days | 0 | This configuration is used to control on the number of recent segments that needs to be compacted, ignoring the older ones.This congifuration is in days.For Example: If the configuration is 2, then the segments which are loaded in the time frame of past 2 days only will get merged. Segments which are loaded earlier than 2 days will not be merged. This configuration is disabled by default.**NOTE:** This configuration is useful when a bulk of history data is loaded into the carbondata.Query on this data is less frequent.In such cases involving these segments also into compacation will affect the resource consumption, increases overall compaction time. |
| carbon.enable.auto.load.merge | false | Compaction can be automatically triggered once data load completes.This ensures that the segments are merged in time and thus query times doesnt increase with increase in segments.This configuration enables to do compaction along with data loading.**NOTE: **Compaction will be triggered once the data load completes.But the status of data load wait till the compaction is completed.Hence it might look like data loading time has increased, but thats not the case.Moreover failure of compaction will not affect the data loading status.If data load had completed successfully, the status would be updated and segments are committed.However, failure while data loading, will not trigger compaction and error is returned immediately. |
| carbon.enable.page.level.reader.in.compaction|true|Enabling page level reader for compaction reduces the memory usage while compacting more number of segments. It allows reading only page by page instead of reading whole blocklet to memory.**NOTE:** Please refer to [file-structure-of-carbondata](../file-structure-of-carbondata.md ) to understand the storage format of CarbonData and concepts of pages.|
| carbon.enable.page.level.reader.in.compaction|true|Enabling page level reader for compaction reduces the memory usage while compacting more number of segments. It allows reading only page by page instead of reading whole blocklet to memory.**NOTE:** Please refer to [file-structure-of-carbondata](./file-structure-of-carbondata.md#carbondata-file-format) to understand the storage format of CarbonData and concepts of pages.|
| carbon.concurrent.compaction | true | Compaction of different tables can be executed concurrently.This configuration determines whether to compact all qualifying tables in parallel or not.**NOTE: **Compacting concurrently is a resource demanding operation and needs more resouces there by affecting the query performance also.This configuration is **deprecated** and might be removed in future releases. |
| carbon.compaction.prefetch.enable | false | Compaction operation is similar to Query + data load where in data from qualifying segments are queried and data loading performed to generate a new single segment.This configuration determines whether to query ahead data from segments and feed it for data loading.**NOTE: **This configuration is disabled by default as it needs extra resources for querying ahead extra data.Based on the memory availability on the cluster, user can enable it to improve compaction performance. |
| carbon.merge.index.in.segment | true | Each CarbonData file has a companion CarbonIndex file which maintains the metadata about the data.These CarbonIndex files are read and loaded into driver and is used subsequently for pruning of data during queries.These CarbonIndex files are very small in size(few KB) and are many.Reading many small files from HDFS is not efficient and leads to slow IO performance.Hence these CarbonIndex files belonging to a segment can be combined into a single file and read once there by increasing the IO throughput.This configuration enables to merge all the CarbonIndex files into a single MergeIndex file upon data loading completion.**NOTE:** Reading a single big file is more efficient in HDFS and IO throughput is very high.Due to this the time needed to load the index files into memory when query is received for the first time on that table is significantly reduced and there by significantly reduces the delay in serving the first query. |
Expand Down
13 changes: 8 additions & 5 deletions docs/datamap-developer-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,17 @@
### Introduction
DataMap is a data structure that can be used to accelerate certain query of the table. Different DataMap can be implemented by developers.
Currently, there are two 2 types of DataMap supported:
1. IndexDataMap: DataMap that leveraging index to accelerate filter query
2. MVDataMap: DataMap that leveraging Materialized View to accelerate olap style query, like SPJG query (select, predicate, join, groupby)
1. IndexDataMap: DataMap that leverages index to accelerate filter query
2. MVDataMap: DataMap that leverages Materialized View to accelerate olap style query, like SPJG query (select, predicate, join, groupby)

### DataMap provider
When user issues `CREATE DATAMAP dm ON TABLE main USING 'provider'`, the corresponding DataMapProvider implementation will be created and initialized.
Currently, the provider string can be:
1. preaggregate: one type of MVDataMap that do pre-aggregate of single table
2. timeseries: one type of MVDataMap that do pre-aggregate based on time dimension of the table
1. preaggregate: A type of MVDataMap that do pre-aggregate of single table
2. timeseries: A type of MVDataMap that do pre-aggregate based on time dimension of the table
3. class name IndexDataMapFactory implementation: Developer can implement new type of IndexDataMap by extending IndexDataMapFactory

When user issues `DROP DATAMAP dm ON TABLE main`, the corresponding DataMapProvider interface will be called.
When user issues `DROP DATAMAP dm ON TABLE main`, the corresponding DataMapProvider interface will be called.

Details about [DataMap Management](./datamap/datamap-management.md#datamap-management) and supported [DSL](./datamap/datamap-management.md#overview) are documented [here](./datamap/datamap-management.md).

5 changes: 3 additions & 2 deletions docs/datamap/bloomfilter-datamap-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ For instance, main table called **datamap_test** which is defined as:
age int,
city string,
country string)
STORED BY 'carbondata'
STORED AS carbondata
TBLPROPERTIES('SORT_COLUMNS'='id')
```

Expand Down Expand Up @@ -144,4 +144,5 @@ You can refer to the corresponding section in `CarbonData Lucene DataMap`.
+ In some scenarios, the BloomFilter datamap may not enhance the query performance significantly
but if it can reduce the number of spark task,
there is still a chance that BloomFilter datamap can enhance the performance for concurrent query.
+ Note that BloomFilter datamap will decrease the data loading performance and may cause slightly storage expansion (for datamap index file).
+ Note that BloomFilter datamap will decrease the data loading performance and may cause slightly storage expansion (for datamap index file).

17 changes: 14 additions & 3 deletions docs/datamap/datamap-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
Expand All @@ -17,6 +17,18 @@

# CarbonData DataMap Management

- [Overview](#overview)
- [DataMap Management](#datamap-management)
- [Automatic Refresh](#automatic-refresh)
- [Manual Refresh](#manual-refresh)
- [DataMap Catalog](#datamap-catalog)
- [DataMap Related Commands](#datamap-related-commands)
- [Explain](#explain)
- [Show DataMap](#show-datamap)
- [Compaction on DataMap](#compaction-on-datamap)



## Overview

DataMap can be created using following DDL
Expand All @@ -36,7 +48,7 @@ Currently, there are 5 DataMap implementations in CarbonData.
| DataMap Provider | Description | DMPROPERTIES | Management |
| ---------------- | ---------------------------------------- | ---------------------------------------- | ---------------- |
| preaggregate | single table pre-aggregate table | No DMPROPERTY is required | Automatic |
| timeseries | time dimension rollup table | event_time, xx_granularity, please refer to [Timeseries DataMap](https://github.com/apache/carbondata/blob/master/docs/datamap/timeseries-datamap-guide.md) | Automatic |
| timeseries | time dimension rollup table | event_time, xx_granularity, please refer to [Timeseries DataMap](./timeseries-datamap-guide.md) | Automatic |
| mv | multi-table pre-aggregate table | No DMPROPERTY is required | Manual |
| lucene | lucene indexing for text column | index_columns to specifying the index columns | Automatic |
| bloomfilter | bloom filter for high cardinality column, geospatial column | index_columns to specifying the index columns | Automatic |
Expand All @@ -49,7 +61,6 @@ There are two kinds of management semantic for DataMap.
2. Manual Refresh: Create datamap with `WITH DEFERRED REBUILD` in the statement

**CAUTION:**
Manual refresh currently only works fine for MV, it has some bugs with other types of datamap in Carbondata 1.4.1, so we block this option for them in this version.
If user create MV datamap without specifying `WITH DEFERRED REBUILD`, carbondata will give a warning and treat the datamap as deferred rebuild.

### Automatic Refresh
Expand Down

0 comments on commit 6e50c1c

Please sign in to comment.