Skip to content
Permalink
Browse files
DRILL-7273: Introduce operators for handling metadata
closes #1886
  • Loading branch information
vvysotskyi committed Nov 14, 2019
1 parent aba268a commit 7ab4c3739f34e9e96cbdfa883325a10311a8ef02
Show file tree
Hide file tree
Showing 157 changed files with 11,739 additions and 1,008 deletions.
@@ -609,7 +609,7 @@ public static MajorType optional(final MinorType type) {
}

public static MajorType overrideMode(final MajorType originalMajorType, final DataMode overrideMode) {
return withPrecisionAndScale(originalMajorType.getMinorType(), overrideMode, originalMajorType.getPrecision(), originalMajorType.getScale());
return originalMajorType.toBuilder().setMode(overrideMode).build();
}

public static MajorType getMajorTypeFromName(final String typeName) {
@@ -17,13 +17,35 @@
*/
package org.apache.drill.common.util.function;

import java.util.function.Supplier;

import static org.apache.drill.common.exceptions.ErrorHelper.sneakyThrow;

/**
* The java standard library does not provide a lambda function interface for functions that take no arguments,
* but that throw an exception. So, we have to define our own here.
* @param <T> The return type of the lambda function.
* @param <E> The type of exception thrown by the lambda function.
*/
@FunctionalInterface
public interface CheckedSupplier<T, E extends Exception> {
T get() throws E;
public interface CheckedSupplier<T, E extends Exception> extends Supplier<T> {

@Override
default T get() {
try {
return getAndThrow();
} catch (Throwable e) {
sneakyThrow(e);
}
// should never happen
throw new RuntimeException();
}

/**
* Gets a result.
*
* @return a result
* @throws E exception in case of errors
*/
T getAndThrow() throws E;
}

Some generated files are not rendered by default. Learn more.

Some generated files are not rendered by default. Learn more.

@@ -0,0 +1,119 @@
# Metastore ANALYZE commands

Drill provides the functionality to collect, use and store table metadata into Drill Metastore.

Set `metastore.enabled` option to true to enable Metastore usage.

To collect table metadata, the following command should be used:

```
ANALYZE TABLE [table_name] [COLUMNS (col1, col2, ...)]
REFRESH METADATA [partition LEVEL]
{COMPUTE | ESTIMATE} | STATISTICS [(column1, column2, ...)]
[ SAMPLE numeric PERCENT ]
```

For the case when this command is executed for the first time, whole table metadata will be collected and stored into
Metastore.
If analyze was already executed for the table, and table data wasn't changed, all further analyze commands wouldn't
trigger table analyzing and message that table metadata is up to date will be returned.

# Incremental analyze

For the case when some table data was updated, Drill will try to execute incremental analyze - calculate metadata only
for updated data and reuse required metadata from the Metastore.

Incremental analyze wouldn't be produced for the following cases:
- list of interesting columns specified in analyze is not a subset of interesting columns from the previous analyze;
- specified metadata level differs from the metadata level in previous analyze.

# Metadata usage

Drill provides the ability to use metadata obtained from the Metastore at the planning stage to prune segments, files
and row groups.

Tables metadata from the Metastore is exposed to `INFORMATION_SCHEMA` tables (if Metastore usage is enabled).

The following tables are populated with table metadata from the Metastore:

`TABLES` table has the following additional columns populated from the Metastore:
- `TABLE_SOURCE` - table data type: `PARQUET`, `CSV`, `JSON`
- `LOCATION` - table location: `/tmp/nation`
- `NUM_ROWS` - number of rows in a table if known, `null` if not known
- `LAST_MODIFIED_TIME` - table's last modification time

`COLUMNS` table has the following additional columns populated from the Metastore:
- `COLUMN_DEFAULT` - column default value
- `COLUMN_FORMAT` - usually applicable for date time columns: `yyyy-MM-dd`
- `NUM_NULLS` - number of nulls in column values
- `MIN_VAL` - column min value in String representation: `aaa`
- `MAX_VAL` - column max value in String representation: `zzz`
- `NDV` - number of distinct values in column, expressed in Double
- `EST_NUM_NON_NULLS` - estimated number of non null values, expressed in Double
- `IS_NESTED` - if column is nested. Nested columns are extracted from columns with struct type.

`PARTITIONS` table has the following additional columns populated from the Metastore:
- `TABLE_CATALOG` - table catalog (currently we have only one catalog): `DRILL`
- `TABLE_SCHEMA` - table schema: `dfs.tmp`
- `TABLE_NAME` - table name: `nation`
- `METADATA_KEY` - top level segment key, the same for all nested segments and partitions: `part_int=3`
- `METADATA_TYPE` - `SEGMENT` or `PARTITION`
- `METADATA_IDENTIFIER` - current metadata identifier: `part_int=3/part_varchar=g`
- `PARTITION_COLUMN` - partition column name: `part_varchar`
- `PARTITION_VALUE` - partition column value: `g`
- `LOCATION` - segment location, `null` for partitions: `/tmp/nation/part_int=3`
- `LAST_MODIFIED_TIME` - last modification time

# Metastore related options

- `metastore.enabled` - enables Drill Metastore usage to be able to store table metadata during `ANALYZE TABLE` commands
execution and to be able to read table metadata during regular queries execution or when querying some `INFORMATION_SCHEMA` tables.
- `metastore.metadata.store.depth_level` - specifies maximum level depth for collecting metadata.
Possible values : `TABLE`, `SEGMENT`, `PARTITION`, `FILE`, `ROW_GROUP`, `ALL`.
- `metastore.metadata.use_schema` - enables schema usage, stored to the Metastore.
- `metastore.metadata.use_statistics` - enables statistics usage, stored in the Metastore, at the planning stage.
- `metastore.metadata.fallback_to_file_metadata` - allows using file metadata cache for the case when required metadata is absent in the Metastore.
- `metastore.retrieval.retry_attempts` - specifies the number of attempts for retrying query planning after detecting that query metadata is changed.
If the number of retries was exceeded, query will be planned without metadata information from the Metastore.

# Analyze operators description

Entry point for `ANALYZE` command is `MetastoreAnalyzeTableHandler` class. It creates plan which includes some
Metastore specific operators for collecting metadata.

`MetastoreAnalyzeTableHandler` uses `AnalyzeInfoProvider` for providing the information
required for building a suitable plan for collecting metadata.
Each group scan should provide corresponding `AnalyzeInfoProvider` implementation class.

Analyze command specific operators:
- `MetadataAggBatch` - operator which adds aggregate calls for all incoming table columns to calculate required
metadata and produces aggregations. If aggregation is performed on top of another aggregation,
required aggregate calls for merging metadata will be added.
- `MetadataHandlerBatch` - operator responsible for handling metadata returned by incoming aggregate operators and
fetching required metadata form the Metastore to produce further aggregations.
- `MetadataControllerBatch` - responsible for converting obtained metadata, fetching absent metadata from the Metastore
and storing resulting metadata into the Metastore.

`MetastoreAnalyzeTableHandler` forms plan depending on segments count in the following form:

```
MetadataControllerBatch
...
MetadataHandlerBatch
MetadataAggBatch(dir0, ...)
MetadataHandlerBatch
MetadataAggBatch(dir0, dir1, ...)
MetadataHandlerBatch
MetadataAggBatch(dir0, dir1, fqn, ...)
Scan(DYNAMIC_STAR **, ANY fqn, ...)
```

The lowest `MetadataAggBatch` creates required aggregate calls for every (or interesting only) table columns
and produces aggregations with grouping by segment columns that correspond to specific table level.
`MetadataHandlerBatch` above it populates batch with additional information about metadata type and other info.
`MetadataAggBatch` above merges metadata calculated before to obtain metadata for parent metadata levels and also stores incoming data to populate it to the Metastore later.

`MetadataControllerBatch` obtains all calculated metadata, converts it to the suitable form and sends it to the Metastore.

For the case of incremental analyze, `MetastoreAnalyzeTableHandler` creates Scan with updated files only
and provides `MetadataHandlerBatch` with information about metadata which should be fetched from the Metastore, so existing actual metadata wouldn't be recalculated.
@@ -46,8 +46,10 @@
]
},
{className: "Max", funcName: "max", types: [
{inputType: "Bit", outputType: "NullableBit", runningType: "Bit", major: "Numeric"},
{inputType: "Int", outputType: "NullableInt", runningType: "Int", major: "Numeric"},
{inputType: "BigInt", outputType: "NullableBigInt", runningType: "BigInt", major: "Numeric"},
{inputType: "NullableBit", outputType: "NullableBit", runningType: "Bit", major: "Numeric"},
{inputType: "NullableInt", outputType: "NullableInt", runningType: "Int", major: "Numeric"},
{inputType: "NullableBigInt", outputType: "NullableBigInt", runningType: "BigInt", major: "Numeric"},
{inputType: "Float4", outputType: "NullableFloat4", runningType: "Float4", major: "Numeric"},
@@ -698,46 +698,129 @@ Pair<SqlNodeList, SqlNodeList> ParenthesizedCompoundIdentifierList() :
</#if>

/**
* Parses a analyze statement.
* ANALYZE TABLE table_name {COMPUTE | ESTIMATE} | STATISTICS
* [(column1, column2, ...)] [ SAMPLE numeric PERCENT ]
* Parses a analyze statements:
* <ul>
* <li>ANALYZE TABLE [table_name] [COLUMNS (col1, col2, ...)] REFRESH METADATA [partition LEVEL] {COMPUTE | ESTIMATE} | STATISTICS [(column1, column2, ...)] [ SAMPLE numeric PERCENT ]
* <li>ANALYZE TABLE [table_name] DROP [METADATA|STATISTICS] [IF EXISTS]
* <li>ANALYZE TABLE [table_name] {COMPUTE | ESTIMATE} | STATISTICS [(column1, column2, ...)] [ SAMPLE numeric PERCENT ]
* </ul>
*/
SqlNode SqlAnalyzeTable() :
{
SqlParserPos pos;
SqlIdentifier tblName;
SqlLiteral estimate = null;
SqlNodeList fieldList = null;
SqlNode level = null;
SqlLiteral estimate = null;
SqlLiteral dropMetadata = null;
SqlLiteral checkMetadataExistence = null;
SqlNumericLiteral percent = null;
}
{
<ANALYZE> { pos = getPos(); }
<TABLE>
tblName = CompoundIdentifier()
(
<COMPUTE> { estimate = SqlLiteral.createBoolean(false, pos); }
|
<ESTIMATE> { estimate = SqlLiteral.createBoolean(true, pos); }
)
<STATISTICS>
[
(fieldList = ParseRequiredFieldList("Table"))
]
[
<SAMPLE> percent = UnsignedNumericLiteral() <PERCENT>
{
BigDecimal rate = percent.bigDecimalValue();
if (rate.compareTo(BigDecimal.ZERO) <= 0 ||
rate.compareTo(BigDecimal.valueOf(100L)) > 0)
(
(
<COMPUTE> { estimate = SqlLiteral.createBoolean(false, pos); }
|
<ESTIMATE> {
if (true) {
throw new ParseException("ESTIMATE statistics collecting is not supported. See DRILL-7438.");
}
estimate = SqlLiteral.createBoolean(true, pos);
}
)
<STATISTICS>
[
(fieldList = ParseRequiredFieldList("Table"))
]
[
<SAMPLE> percent = UnsignedNumericLiteral() <PERCENT>
{
BigDecimal rate = percent.bigDecimalValue();
if (rate.compareTo(BigDecimal.ZERO) <= 0 ||
rate.compareTo(BigDecimal.valueOf(100L)) > 0)
{
throw new ParseException("Invalid percentage for ANALYZE TABLE");
}
}
]
{
throw new ParseException("Invalid percentage for ANALYZE TABLE");
if (percent == null) { percent = SqlLiteral.createExactNumeric("100.0", pos); }
return new SqlAnalyzeTable(pos, tblName, estimate, fieldList, percent);
}
}
)
|
(
[
<COLUMNS>
(
fieldList = ParseRequiredFieldList("Table")
|
<NONE> {fieldList = SqlNodeList.EMPTY;}
)
]
<REFRESH>
<METADATA>
[
level = StringLiteral()
<LEVEL>
]
[
(
<COMPUTE> { estimate = SqlLiteral.createBoolean(false, pos); }
|
<ESTIMATE> {
if (true) {
throw new ParseException("ESTIMATE statistics collecting is not supported. See DRILL-7438.");
}
estimate = SqlLiteral.createBoolean(true, pos);
}
)
<STATISTICS>
]
[
<SAMPLE> percent = UnsignedNumericLiteral() <PERCENT>
{
BigDecimal rate = percent.bigDecimalValue();
if (rate.compareTo(BigDecimal.ZERO) <= 0 ||
rate.compareTo(BigDecimal.valueOf(100L)) > 0) {
throw new ParseException("Invalid percentage for ANALYZE TABLE");
}
}
]
{
return new SqlMetastoreAnalyzeTable(pos, tblName, fieldList, level, estimate, percent);
}
)
|
(
<DROP>
[
<METADATA> { dropMetadata = SqlLiteral.createCharString("METADATA", pos); }
|
<STATISTICS> {
if (true) {
throw new ParseException("DROP STATISTICS is not supported.");
}
dropMetadata = SqlLiteral.createCharString("STATISTICS", pos);
}
]
[
<IF>
<EXISTS> { checkMetadataExistence = SqlLiteral.createBoolean(false, pos); }
]
{
if (checkMetadataExistence == null) {
checkMetadataExistence = SqlLiteral.createBoolean(true, pos);
}
return new SqlDropTableMetadata(pos, tblName, dropMetadata, checkMetadataExistence);
}
)
]
{
if (percent == null) { percent = SqlLiteral.createExactNumeric("100.0", pos); }
return new SqlAnalyzeTable(pos, tblName, estimate, fieldList, percent);
}
{ throw generateParseException(); }
}


0 comments on commit 7ab4c37

Please sign in to comment.