Skip to content
Permalink
Browse files
[CARBONDATA-4240]: Added missing properties on the configurations page
Why is this PR needed?
Few properties which were not present on configurations page but are
user facing properties have been added.

What changes were proposed in this PR?
Addition of missing properties

Does this PR introduce any user interface change?
No

Is any new testcase added?
No

This Closes #4210
  • Loading branch information
pratyakshsharma authored and akashrn5 committed Oct 28, 2021
1 parent 9dbd2a5 commit 7d94691deb3300624ce4b22c4563cb4b9da776fa
Showing 3 changed files with 30 additions and 7 deletions.
@@ -52,6 +52,11 @@ This section provides the details of all the configurations required for the Car
| carbon.trash.retention.days | 7 | This parameter specifies the number of days after which the timestamp based subdirectories are expired in the trash folder. Allowed Min value = 0, Allowed Max Value = 365 days|
| carbon.clean.file.force.allowed | false | This parameter specifies if the clean files operation with force option is allowed or not.|
| carbon.cdc.minmax.pruning.enabled | false | This parameter defines whether the min max pruning to be performed on the target table based on the source data. It will be useful when data is not sparse across target table which results in better pruning.|
| carbon.blocklet.size | 64 MB | Carbondata file consists of blocklets which further consists of column pages. As per the latest V3 format, the default size of a blocklet is 64 MB. It is recommended not to change this value except for some specific use case. |
| carbon.date.format | yyyy-MM-dd | This property specifies the format which is used for parsing the incoming date field values. |
| carbon.lock.class | (none) | This specifies the implementation of ICarbonLock interface to be used for acquiring the locks in case of concurrent operations |
| carbon.data.file.version | V3 | This specifies carbondata file format version. Carbondata file format has evolved with time from V1 to V3 in terms of metadata storage and IO level pruning capabilities. You can find more details [here](https://carbondata.apache.org/file-structure-of-carbondata.html#carbondata-file-format). |
| spark.carbon.hive.schema.store | false | Carbondata currently supports 2 different types of metastores for storing schemas. This property specifies if Hive metastore is to be used for storing and retrieving table schemas |

## Data Loading Configuration

@@ -70,6 +75,7 @@ This section provides the details of all the configurations required for the Car
| carbon.load.global.sort.partitions | 0 | The number of partitions to use when shuffling data for global sort. Default value 0 means to use same number of map tasks as reduce tasks. **NOTE:** In general, it is recommended to have 2-3 tasks per CPU core in your cluster. |
| carbon.sort.size | 100000 | Number of records to hold in memory to sort and write intermediate sort temp files. **NOTE:** Memory required for data loading will increase if you turn this value bigger. Besides each thread will cache this amout of records. The number of threads is configured by *carbon.number.of.cores.while.loading*. |
| carbon.options.bad.records.logger.enable | false | CarbonData can identify the records that are not conformant to schema and isolate them as bad records. Enabling this configuration will make CarbonData to log such bad records. **NOTE:** If the input data contains many bad records, logging them will slow down the over all data loading throughput. The data load operation status would depend on the configuration in ***carbon.bad.records.action***. |
| carbon.options.bad.records.action | FAIL | This property has four types of bad record actions: FORCE, REDIRECT, IGNORE and FAIL. If set to FORCE then it auto-corrects the data by storing the bad records as NULL. If set to REDIRECT then bad records are written to the raw CSV instead of being loaded. If set to IGNORE then bad records are neither loaded nor written to the raw CSV. If set to FAIL then data loading fails if any bad records are found. Also this property can be set at different levels - the first priority is given to load options with property name as `BAD_RECORDS_ACTION`. In case that is not present, `carbon.options.bad.records.action` is used and if this is also not present then system level property `carbon.bad.records.action` is considered. In case nothing is configured, default value of FAIL is used. |
| carbon.bad.records.action | FAIL | CarbonData in addition to identifying the bad records, can take certain actions on such data. This configuration can have four types of actions for bad records namely FORCE, REDIRECT, IGNORE and FAIL. If set to FORCE then it auto-corrects the data by storing the bad records as NULL. If set to REDIRECT then bad records are written to the raw CSV instead of being loaded. If set to IGNORE then bad records are neither loaded nor written to the raw CSV. If set to FAIL then data loading fails if any bad records are found. |
| carbon.options.is.empty.data.bad.record | false | Based on the business scenarios, empty("" or '' or ,,) data can be valid or invalid. This configuration controls how empty data should be treated by CarbonData. If false, then empty ("" or '' or ,,) data will not be considered as bad record and vice versa. |
| carbon.options.bad.record.path | (none) | Specifies the HDFS path where bad records are to be stored. By default the value is Null. This path must be configured by the user if ***carbon.options.bad.records.logger.enable*** is **true** or ***carbon.bad.records.action*** is **REDIRECT**. |
@@ -93,12 +99,15 @@ This section provides the details of all the configurations required for the Car
| carbon.options.serialization.null.format | \N | Based on the business scenarios, some columns might need to be loaded with null values. As null value cannot be written in csv files, some special characters might be adopted to specify null values. This configuration can be used to specify the null values format in the data being loaded. |
| carbon.column.compressor | snappy | CarbonData will compress the column values using the compressor specified by this configuration. Currently CarbonData supports 'snappy', 'zstd' and 'gzip' compressors. |
| carbon.minmax.allowed.byte.count | 200 | CarbonData will write the min max values for string/varchar types column using the byte count specified by this configuration. Max value is 1000 bytes(500 characters) and Min value is 10 bytes(5 characters). **NOTE:** This property is useful for reducing the store size thereby improving the query performance but can lead to query degradation if value is not configured properly. | |
| carbon.merge.index.failure.throw.exception | true | It is used to configure whether or not merge index failure should result in data load failure also. |
| carbon.binary.decoder | None | Support configurable decode for loading. Two decoders supported: base64 and hex |
| carbon.local.dictionary.size.threshold.inmb | 4 | size based threshold for local dictionary in MB, maximum allowed size is 16 MB. |
| carbon.enable.bad.record.handling.for.insert | false | by default, disable the bad record and converter step during "insert into" |
| carbon.load.si.repair | true | by default, enable loading for failed segments in SI during load/insert command |
| carbon.enable.bad.record.handling.for.insert | false | By default, disable the bad record and converter step during "insert into" |
| carbon.load.si.repair | true | By default, enable loading for failed segments in SI during load/insert command |
| carbon.si.repair.limit | (none) | Number of failed segments to be loaded in SI when repairing missing segments in SI, by default load all the missing segments. Supports value from 0 to 2147483646 |
| carbon.complex.delimiter.level.1 | # | This delimiter is used for parsing complex data type columns. Level 1 delimiter splits the complex type data column in a row (eg., a\001b\001c --> Array = {a,b,c}). |
| carbon.complex.delimiter.level.2 | $ | This delimiter splits the complex type nested data column in a row. Applies level_1 delimiter & applies level_2 based on complex data type (eg., a\002b\001c\002d --> Array> = {{a,b},{c,d}}). |
| carbon.complex.delimiter.level.3 | @ | This delimiter splits the complex type nested data column in a row. Applies level_1 delimiter, applies level_2 and then level_3 delimiter based on complex data type. Used in case of nested Complex Map type. (eg., 'a\003b\002b\003c\001aa\003bb\002cc\003dd' --> Array Of Map> = {{a -> b, b -> c},{aa -> bb, cc -> dd}}). |
| carbon.complex.delimiter.level.4 | (none) | All the levels of delimiters are used for parsing complex data type columns. All the delimiters are applied depending on the complexity of the given data type. Level 4 delimiter will be used for parsing the complex values after level 3 delimiter has been applied already. |

## Compaction Configuration

@@ -113,12 +122,13 @@ This section provides the details of all the configurations required for the Car
| carbon.numberof.preserve.segments | 0 | If the user wants to preserve some number of segments from being compacted then he can set this configuration. Example: carbon.numberof.preserve.segments = 2 then 2 latest segments will always be excluded from the compaction. No segments will be preserved by default. **NOTE:** This configuration is useful when the chances of input data can be wrong due to environment scenarios. Preserving some of the latest segments from being compacted can help to easily delete the wrongly loaded segments. Once compacted,it becomes more difficult to determine the exact data to be deleted(except when data is incrementing according to time) |
| carbon.allowed.compaction.days | 0 | This configuration is used to control on the number of recent segments that needs to be compacted, ignoring the older ones. This configuration is in days. For Example: If the configuration is 2, then the segments which are loaded in the time frame of past 2 days only will get merged. Segments which are loaded earlier than 2 days will not be merged. This configuration is disabled by default. **NOTE:** This configuration is useful when a bulk of history data is loaded into the carbondata. Query on this data is less frequent. In such cases involving these segments also into compaction will affect the resource consumption, increases overall compaction time. |
| carbon.enable.auto.load.merge | false | Compaction can be automatically triggered once data load completes. This ensures that the segments are merged in time and thus query times does not increase with increase in segments. This configuration enables to do compaction along with data loading. **NOTE:** Compaction will be triggered once the data load completes. But the status of data load wait till the compaction is completed. Hence it might look like data loading time has increased, but thats not the case. Moreover failure of compaction will not affect the data loading status. If data load had completed successfully, the status would be updated and segments are committed. However, failure while data loading, will not trigger compaction and error is returned immediately. |
| carbon.enable.page.level.reader.in.compaction|false|Enabling page level reader for compaction reduces the memory usage while compacting more number of segments. It allows reading only page by page instead of reading whole blocklet to memory. **NOTE:** Please refer to [file-structure-of-carbondata](./file-structure-of-carbondata.md#carbondata-file-format) to understand the storage format of CarbonData and concepts of pages.|
| carbon.enable.page.level.reader.in.compaction | false | Enabling page level reader for compaction reduces the memory usage while compacting more number of segments. It allows reading only page by page instead of reading whole blocklet to memory. **NOTE:** Please refer to [file-structure-of-carbondata](./file-structure-of-carbondata.md#carbondata-file-format) to understand the storage format of CarbonData and concepts of pages.|
| carbon.concurrent.compaction | true | Compaction of different tables can be executed concurrently. This configuration determines whether to compact all qualifying tables in parallel or not. **NOTE:** Compacting concurrently is a resource demanding operation and needs more resources there by affecting the query performance also. This configuration is **deprecated** and might be removed in future releases. |
| carbon.compaction.prefetch.enable | false | Compaction operation is similar to Query + data load where in data from qualifying segments are queried and data loading performed to generate a new single segment. This configuration determines whether to query ahead data from segments and feed it for data loading. **NOTE:** This configuration is disabled by default as it needs extra resources for querying extra data. Based on the memory availability on the cluster, user can enable it to improve compaction performance. |
| carbon.enable.range.compaction | true | To configure Range-based Compaction to be used or not for RANGE_COLUMN. If true after compaction also the data would be present in ranges. |
| carbon.si.segment.merge | false | Making this true degrades the LOAD performance. When the number of small files increase for SI segments(it can happen as number of columns will be less and we store position id and reference columns), user can either set to true which will merge the data files for upcoming loads or run SI refresh command which does this job for all segments. (REFRESH INDEX <index_table>) |
| carbon.partition.data.on.tasklevel | false | When enabled, tasks launched for Local sort partition load will be based on one node one task. Compaction will be performed based on task level for a partition. Load performance might be degraded, because, the number of tasks launched is equal to number of nodes in case of local sort. For compaction, memory consumption will be less, as more number of tasks will be launched for a partition |
| carbon.minor.compaction.size | (none) | Minor compaction originally worked based on the number of segments (by default 4). However in that scenario, there was no control over the size of segments to be compacted. This parameter was introduced to exclude segments whose size is greater than the configured threshold so that the overall IO and time taken decreases |

## Query Configuration

@@ -151,6 +161,16 @@ This section provides the details of all the configurations required for the Car
| carbon.partition.max.driver.lru.cache.size | -1 | Maximum memory **(in MB)** upto which driver can cache partition metadata. Beyond this, least recently used data will be removed from cache before loading new set of values.
| carbon.mapOrderPushDown.<db_name>_<table_name>.column| empty | If order by column is in sort column, specify that sort column here to avoid ordering at map task . |
| carbon.metacache.expiration.seconds | Long.MAX_VALUE | Expiration time **(in seconds)** for tableInfo cache in CarbonMetadata and tableModifiedTime in CarbonFileMetastore, after the time configured since last access to the cache entry, tableInfo and tableModifiedTime will be removed from each cache. Recent access will refresh the timer. Default value of Long.MAX_VALUE means the cache will not be expired by time. **NOTE:** At the time when cache is being expired, queries on the table may fail with NullPointerException. |
| is.driver.instance | false | This parameter decides if LRU cache for storing indexes need to be created on driver. By default, it is created on executors. |
| carbon.input.metrics.update.interval | 500000 | This property determines the number of records queried after which input metrics are updated to spark. It can be set dynamically within spark session itself as well. |
| carbon.use.bitset.pipe.line | true | Carbondata has various optimizations for faster query execution. Setting this property acts like a catalyst for filter queries. If set to true, the bitset is passed from one filter to another, resulting in incremental filtering and improving overall performance |

## Index Configuration
| Parameter | Default Value | Description |
|--------------------------------------|---------------|---------------------------------------------------|
| carbon.lucene.index.stop.words | false | By default, lucene does not create index for stop words like 'is', 'the' etc. This flag is used to override this behaviour |
| carbon.load.dateformat.setlenient.enable | false | This property enables parsing of timestamp/date data in load flow if the parsing fails with invalid timestamp data error. For example: 1941-03-15 00:00:00 is valid time in Asia/Calcutta zone and is invalid and will fail to parse in Asia/Shanghai zone as DST is observed and clocks were turned forward 1 hour to 1941-03-15 01:00:00 |
| carbon.indexserver.tempfolder.deletetime | 10800000 | This specifies the time period in milliseconds after which temp folder gets deleted from index server |

## Data Mutation Configuration
| Parameter | Default Value | Description |
@@ -237,6 +257,9 @@ RESET
| carbon.enable.index.server | To use index server for caching and pruning. This property can be used for a session or for a particular table with ***carbon.enable.index.server.<db_name>.<table_name>***. |
| carbon.reorder.filter | This property can be used to enabled/disable filter reordering. Should be disabled only when the user has optimized the filter condition. |
| carbon.mapOrderPushDown.<db_name>_<table_name>.column | If order by column is in sort column, specify that sort column here to avoid ordering at map task . |
| carbon.load.dateformat.setlenient.enable | To enable parsing of timestamp/date data in load flow if the parsing fails with invalid timestamp data error. **NOTE** Refer to [Index Configuration](#index-configuration)#carbon.load.dateformat.setlenient.enable for detailed information. |
| carbon.minor.compaction.size | Puts an upper limit on the size of segments to be included for compaction. **NOTE** Refer to [Compaction Configuration](#compaction-configuration)#carbon.minor.compaction.size for detailed information. |
| carbon.input.metrics.update.interval | Determines the number of records queried after which input metrics are updated to spark. **NOTE** Refer to [Query Configuration](#query-configuration)#carbon.minor.compaction.size for detailed information. |
**Examples:**

* Add or Update:
@@ -641,7 +641,7 @@ CarbonData DDL statements are documented here,which includes:
This function creates a new database. By default the database is created in location 'spark.sql.warehouse.dir', but you can also specify custom location by configuring 'spark.sql.warehouse.dir', the configuration 'carbon.storelocation' has been deprecated.
**Note:**
For simplicity, we recommended you remove the configuration of carbon.storelocation. If carbon.storelocaiton and spark.sql.warehouse.dir are configured to different paths, exception will be thrown when CREATE DATABASE and DROP DATABASE to avoid inconsistent database location.
For simplicity, we recommended you remove the configuration of carbon.storelocation. If carbon.storelocation and spark.sql.warehouse.dir are configured to different paths, exception will be thrown when CREATE DATABASE and DROP DATABASE to avoid inconsistent database location.
```
@@ -259,7 +259,7 @@ carbon.sql(

3. Add the carbonlib folder path in the Spark classpath. (Edit `$SPARK_HOME/conf/spark-env.sh` file and modify the value of `SPARK_CLASSPATH` by appending `$SPARK_HOME/carbonlib/*` to the existing value)

4. Copy the `./conf/carbon.properties.template` file from CarbonData repository to `$SPARK_HOME/conf/` folder and rename the file to `carbon.properties`.
4. Copy the `./conf/carbon.properties.template` file from CarbonData repository to `$SPARK_HOME/conf/` folder and rename the file to `carbon.properties`. All the carbondata related properties are configured in this file.

5. Repeat Step 2 to Step 5 in all the nodes of the cluster.

@@ -304,7 +304,7 @@ carbon.sql(

**NOTE**: Create the carbonlib folder if it does not exists inside `$SPARK_HOME` path.

2. Copy the `./conf/carbon.properties.template` file from CarbonData repository to `$SPARK_HOME/conf/` folder and rename the file to `carbon.properties`.
2. Copy the `./conf/carbon.properties.template` file from CarbonData repository to `$SPARK_HOME/conf/` folder and rename the file to `carbon.properties`. All the carbondata related properties are configured in this file.

3. Create `tar.gz` file of carbonlib folder and move it inside the carbonlib folder.

0 comments on commit 7d94691

Please sign in to comment.