Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-7273: Introduce operators for handling metadata #1886

Closed
wants to merge 1 commit into from

Conversation

@vvysotskyi
Copy link
Member

vvysotskyi commented Oct 29, 2019

Jira: DRILL-7273

This pull request introduces commands and operators for collecting table metadata and storing it to the metastore.

Entry point for ANALYZE command is MetastoreAnalyzeTableHandler class. It creates plan which includes some metastore-specific operators for collecting metadata.

New operators are the following:
MetadataAggBatch - operator which adds aggregate calls for all incoming table columns to calculate required metadata and produces aggregations. If aggregation is performed on top of another aggregation, required aggregate calls for merging metadata will be added.

MetadataHandlerBatch - operator responsible for handling metadata returned by incoming aggregate operators and fetching required metadata form the metastore to produce further aggregations.

MetadataControllerBatch - responsible for converting obtained metadata, fetching absent metadata from the metastore and storing resulting metadata into the metastore.

MetastoreAnalyzeTableHandler has 2 classes which depending on the table type, provides the information required for building a suitable plan for collecting metadata: AnalyzeInfoProvider and MetadataInfoCollector.

MetastoreAnalyzeTableHandler based on segments count, forms plan in the following form:

MetadataControllerBatch
	...
		MetadataHandlerBatch
			MetadataAggBatch
				MetadataHandlerBatch
					MetadataAggBatch
						Scan

The lowest MetadataAggBatch creates required aggregate calls for every (or interesting only) table columns and produces aggregations with grouping by segment columns that correspond to specific table level.
MetadataHandlerBatch above it populates batch with additional information about metadata type and other info.
MetadataAggBatch above merges metadata calculated before to obtain metadata for parent metadata levels and also stores incoming data to populate it to the metastore later.

MetadataControllerBatch obtains all calculated metadata, converts it to the suitable form and sends it to the metastore.

For the case of incremental analyze, MetastoreAnalyzeTableHandler creates Scan with updated files only and provides MetadataHandlerBatch with information about metadata which should be fetched from the metastore, so existing actual metadata wouldn't be recalculated.

.forEach(arguments::add);

// collectedMap field value
arguments.add(null);

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Oct 31, 2019

Member

Null adding is a little bit confusing...

This comment has been minimized.

Copy link
@vvysotskyi

vvysotskyi Nov 11, 2019

Author Member

Agree. Replaced with an empty array.


@SuppressWarnings("unchecked")
private <T extends BaseMetadata & LocationProvider> VectorContainer writeMetadataUsingBatchSchema(List<T> segments) {
Preconditions.checkState(segments.size() > 0, "Unable to fetch segments.");

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Oct 31, 2019

Member

If segments size is more than 0, it means we could not fetch segments?

This comment has been minimized.

Copy link
@vvysotskyi

vvysotskyi Nov 11, 2019

Author Member

It's an obsolete message, replaced it with more suitable.


allMetaToHandle.addAll(segmentsToUpdate);

// is handled separately since it is not overridden when writing the metadata

This comment has been minimized.

Copy link
@arina-ielchiieva

This comment has been minimized.

Copy link
@vvysotskyi

vvysotskyi Nov 11, 2019

Author Member

It was meant about removed top-level segments, thanks, updated comment.

.collect(Collectors.toList());
metadataToRemove.addAll(removedTopSegments);
} else {
// table metadata may still actual

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Oct 31, 2019

Member
Suggested change
// table metadata may still actual
// table metadata may be still actual
Copy link
Member

arina-ielchiieva left a comment

Some initial comments

@vvysotskyi vvysotskyi force-pushed the vvysotskyi:DRILL-7273 branch 2 times, most recently from bc2cf5d to 9378ff2 Nov 11, 2019
Copy link
Member Author

vvysotskyi left a comment

@arina-ielchiieva, thanks a lot for the review and sorry for so many changes.
I have addressed CR comments in two new commits. Could you please review the new changes?

@@ -246,6 +257,7 @@ public String toString() {
builder.append(", numFiles=").append(getEntries().size());
builder.append(", numRowGroups=").append(getRowGroupsMetadata().size());
builder.append(", usedMetadataFile=").append(usedMetadataCache);
builder.append(", usedMetastore=").append(usedMetastore);

This comment has been minimized.

Copy link
@vvysotskyi

vvysotskyi Nov 11, 2019

Author Member

Yes, it would be useful to point out that metastore was used. Thanks, added it to MetadataDirectGroupScan.

}

if (metadataType == MetadataType.ROW_GROUP) {
arguments.add(Long.toString(((RowGroupMetadata) metadata).getRowGroupIndex()));

This comment has been minimized.

Copy link
@vvysotskyi

vvysotskyi Nov 11, 2019

Author Member

Thanks, replaced. But left in places like the line below since the wrong overloaded method was chosen and code failed with CCE.

log("Drill Plan", plan, logger);
return plan;
} finally {
context.getOptions().setLocalOption(ExecConstants.METASTORE_ENABLED, true);

This comment has been minimized.

Copy link
@vvysotskyi

vvysotskyi Nov 11, 2019

Author Member

No, it is not required, thanks, removed it.

.forEach(arguments::add);

// collectedMap field value
arguments.add(null);

This comment has been minimized.

Copy link
@vvysotskyi

vvysotskyi Nov 11, 2019

Author Member

Agree. Replaced with an empty array.


@SuppressWarnings("unchecked")
private <T extends BaseMetadata & LocationProvider> VectorContainer writeMetadataUsingBatchSchema(List<T> segments) {
Preconditions.checkState(segments.size() > 0, "Unable to fetch segments.");

This comment has been minimized.

Copy link
@vvysotskyi

vvysotskyi Nov 11, 2019

Author Member

It's an obsolete message, replaced it with more suitable.

@@ -489,6 +489,21 @@ private ExecConstants() {
public static final String IMPLICIT_FILEPATH_COLUMN_LABEL = "drill.exec.storage.implicit.filepath.column.label";
public static final OptionValidator IMPLICIT_FILEPATH_COLUMN_LABEL_VALIDATOR = new StringValidator(IMPLICIT_FILEPATH_COLUMN_LABEL,
new OptionDescription("Available as of Drill 1.10. Sets the implicit column name for the filepath column."));
public static final String IMPLICIT_ROW_GROUP_INDEX_COLUMN_LABEL = "drill.exec.storage.implicit.row_group_index.column.label";
public static final OptionValidator IMPLICIT_ROW_GROUP_INDEX_COLUMN_LABEL_VALIDATOR = new StringValidator(IMPLICIT_ROW_GROUP_INDEX_COLUMN_LABEL,
new OptionDescription("Available as of Drill 1.17. Sets the implicit column name for the row group index (rgi) column."));

This comment has been minimized.

Copy link
@vvysotskyi

vvysotskyi Nov 11, 2019

Author Member

Completed a description with the information about this option.

/**
* Enum of table types.
*/
public enum TableType {

This comment has been minimized.

Copy link
@vvysotskyi

vvysotskyi Nov 11, 2019

Author Member

Thanks for pointing this, agree that it should be pluggable, added AnalyzeInfoProviderRegistry to pick up dynamically all implementations of AnalyzeInfoProvider.

try {
appendStatistics(statisticsCollector);
} catch (IOException e) {
throw new RuntimeException(e);

This comment has been minimized.

Copy link
@vvysotskyi

vvysotskyi Nov 11, 2019

Author Member

Setting failed executor state. Thanks for pointing this.

.workspace(schemaName)
.name(tableName)
.build();
MetastoreTableInfo metastoreTableInfo = metastoreRegistry.get().tables().basicRequests().metastoreTableInfo(tableInfo);

This comment has been minimized.

Copy link
@vvysotskyi

vvysotskyi Nov 11, 2019

Author Member

Agree, added logic to continue query execution.

MetadataIdentifierUtils.getValuesFromMetadataIdentifier(metadata.getMetadataInfo().identifier()),
popConfig.getMetadataHandlerContext().segmentColumns().size()));

// adds column statistics values assuming that they are sorted in alphabetic order

This comment has been minimized.

Copy link
@vvysotskyi

vvysotskyi Nov 11, 2019

Author Member

getResultSetLoaderForMetadata() method returns statistics values already sorted in alphabetic order regarding their statistics kinds.

@vvysotskyi vvysotskyi force-pushed the vvysotskyi:DRILL-7273 branch 2 times, most recently from a297db3 to d3ee488 Nov 11, 2019
`MetastoreAnalyzeTableHandler` uses `AnalyzeInfoProvider` for providing the information
required for building a suitable plan for collecting metadata.
Every group scan should provide each `AnalyzeInfoProvider` implementation and annotate implementation class with

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Nov 11, 2019

Member

... should provide corresponding AnalyzeInfoProvider implementation class.

`AnalyzeInfoProviderTemplate` annotation to load it dynamically when executing analyze.

Analyze specific operators:

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Nov 11, 2019

Member

Analyze command specific operators:

`MetadataAggBatch` above merges metadata calculated before to obtain metadata for parent metadata levels and also stores incoming data to populate it to the metastore later.

`MetadataControllerBatch` obtains all calculated metadata, converts it to the suitable form and sends it to the metastore.

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Nov 11, 2019

Member

metastore -> Metastore


For the case of incremental analyze, `MetastoreAnalyzeTableHandler` creates Scan with updated files only
and provides `MetadataHandlerBatch` with information about metadata which should be fetched from the metastore, so existing actual metadata wouldn't be recalculated.

This comment has been minimized.

Copy link
@arina-ielchiieva
import org.apache.drill.common.exceptions.DrillRuntimeException;

/**
* Exception which indicates that table metadata is outdated.

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Nov 11, 2019

Member

Please update java doc

}

if (!(instance instanceof AnalyzeInfoProvider)) {
logger.error("Created instance of [{}] does not implement [{}] interface",

This comment has been minimized.

Copy link
@arina-ielchiieva
import java.lang.annotation.Target;

/**
* Annotation used to mark all {@link AnalyzeInfoProvider} implementations which

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Nov 11, 2019

Member

Marks all ...

*/
@Retention(RetentionPolicy.RUNTIME)
@Target({ElementType.TYPE})
public @interface AnalyzeInfoProviderTemplate {

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Nov 11, 2019

Member

Why we call it template? I think AnalyzeInfoProvider is more suitable.


private List<AnalyzeInfoProvider> init() {
List<AnalyzeInfoProvider> providers;
List<AnnotatedClassDescriptor> annotatedClasses = classpathScan.getAnnotatedClasses(AnalyzeInfoProviderTemplate.class.getName());

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Nov 11, 2019

Member

Why we annotate classes if we can simply look for classes which implement AnalyzeInfoProvider interface thus annotation can be removed.

@@ -131,7 +130,9 @@ private static PhysicalPlan convertPlan(QueryContext context, String sql, Pointe
"Will sync remote and local function registries if needed and retry " +
"in case if issue was due to missing function implementation.", e);
if (context.getFunctionRegistry().syncWithRemoteRegistry(
context.getDrillOperatorTable().getFunctionRegistryVersion())) {
context.getDrillOperatorTable().getFunctionRegistryVersion())

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Nov 11, 2019

Member

Please explain? What about CTAS for example?

This comment has been minimized.

Copy link
@vvysotskyi

vvysotskyi Nov 11, 2019

Author Member

Thanks, CTAS with EXPLAIN also should be present in this check. The goal was to avoid rerunning queries that cannot have UDFs but failed with an exception. For example, for Metastore ANALYZE, Metastore is being disabled using a query-level option before plan constructing but after the check that it was enabled. So after rerunning such a query, it will fail with another exception that Metastore is disabled.

@@ -109,7 +108,8 @@ public PhysicalPlan getPlan(SqlNode sqlNode)
config.getConverter().getDefaultSchema(), sqlAnalyzeTable.getSchemaPath());
DrillTable table = getDrillTable(drillSchema, sqlAnalyzeTable.getName());

AnalyzeInfoProvider analyzeInfoProvider = AnalyzeInfoProvider.getAnalyzeInfoProvider(TableType.getTableType(table.getGroupScan()));
AnalyzeInfoProvider analyzeInfoProvider = context.getAnalyzeInfoProviderRegistry()

This comment has been minimized.

Copy link
@arina-ielchiieva

arina-ielchiieva Nov 11, 2019

Member

I think GroupScan shoulder return analyzeInfoProvider instance.

@vvysotskyi vvysotskyi force-pushed the vvysotskyi:DRILL-7273 branch from 3153db2 to ecc6e07 Nov 11, 2019
@arina-ielchiieva

This comment has been minimized.

Copy link
Member

arina-ielchiieva commented Nov 12, 2019

+1, please squash the commits.

@vvysotskyi vvysotskyi force-pushed the vvysotskyi:DRILL-7273 branch from ecc6e07 to 9eb4948 Nov 12, 2019
@asfgit asfgit closed this in 7ab4c37 Nov 14, 2019
xiangt920 added a commit to xiangt920/drill that referenced this pull request Dec 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.