Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-7095: Expose table schema (TupleMetadata) to physical operator (EasySubScan) #1696

Closed
wants to merge 2 commits into from

Conversation

arina-ielchiieva
Copy link
Member

@arina-ielchiieva arina-ielchiieva commented Mar 14, 2019

  1. Add system / session option store.table.use_schema_file to control if file schema can be used during query execution. False by default.
  2. Added methods in StoragePlugin interface which allow to create Group Scan with provided table schema.
  3. EasyGroupScan and EasySubScan now contain table schema, also they are able to serialize / deserialize it along with other scan properties.
  4. DrillTable which is the main entry point for schema provisioning, has method to store schema and later uses it to create physical scan.
  5. WorkspaceSchema when returning Drill table instance will get table schema from table root if available and if store.table.use_schema_file is set to true.

This PR is the next step for Schema Provisioning project which currently exposes schema only for text reader.

Jira: DRILL-7095.

@arina-ielchiieva
Copy link
Member Author

@paul-rogers please review.

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Major concern is that the schema should be a property of the group scan and sub scan, not of the storage plugin config.

return table;
}

private void setSchema(DrillTable table, String tableName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the DrillTable know its own name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FsMetastoreSchemaProvider schemaProvider = new FsMetastoreSchemaProvider(this, tableName);
table.setSchema(schemaProvider.read().getSchema());
} catch (IOException e) {
logger.debug("Unable to deserialize schema from schema file for table: " + tableName, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, this might handle other schema sources than just the file. This is a perfectly fine implementation for getting started; just something to remember for later.

}

public EasyGroupScan(String userName, FileSelection selection, EasyFormatPlugin<?> formatPlugin,
List<SchemaPath> columns, Path selectionRoot) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed now, but the arg list is getting long enough that I suspect we'll want a "builder" abstraction at some point.

) throws IOException{
super(userName);
this.selection = Preconditions.checkNotNull(selection);
this.formatPlugin = Preconditions.checkNotNull(formatPlugin, "Unable to load format plugin for provided format config.");
if (schema != null) {
this.formatPlugin.setSchema(schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this makes sense. The format plugin is shared across multiple scans. I don't believe it is copied anew for each scan. So, I think the schema should be a property of the group scan, not the plugin.

Also, the plugin is not serialized; it is created anew (IIRC) on each node.

Note that, once the schema is an attribute of the group scan, it needs to be set in the copy constructor. Since we don't expect the schema to change, we can just have the copy reuse the same schema as the original: no need to copy the schema itself. Should add a comment to explain this.


@JsonProperty
public TupleMetadata getSchema() {
return formatPlugin.getSchema();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, schema should be a property of this class. Once it is, the pointer to the schema must be copied to the sub-scan class in the getSpecificScan() method.

It is the sub-scan that will be serialized out to the physical plan, then deserialized on each worker node.

super(userName);
this.formatPlugin = (EasyFormatPlugin<?>) engineRegistry.getFormatPlugin(storageConfig, formatConfig);
Preconditions.checkNotNull(this.formatPlugin);
this.formatPlugin.setSchema(schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, can't store this in the plugin: the plugin is (or should be) shared.

Or, are we serializing the plugin (config) to handle table properties? If so, then the schema can be set via the Drill web console, which I don't think we want as it will encourage people to create plugin configs for table schema, but that will require that each file have a distinct file suffix, which will cause confusion. (Already does for CSV/CSVH).

@@ -219,6 +220,8 @@ protected ColumnsScanFramework buildFramework(
}
}

private TupleMetadata schema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See earlier notes.

@Override
public void setSchema(TupleMetadata schema) {
this.schema = schema;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be best if the schema can be immutable: set in the constructor and not changeable once set. Else, it is very unclear what the semantics will be.

Same is true, by the way, for the scan nodes.

@@ -683,5 +683,6 @@ drill.exec.options: {
exec.query.return_result_set_for_ddl: true,
# ========= rm related options ===========
exec.rm.queryTags: "",
exec.rm.queues.wait_for_preferred_nodes: true
exec.rm.queues.wait_for_preferred_nodes: true,
store.table.use_schema_file: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this closer to other store-related options?

@arina-ielchiieva
Copy link
Member Author

@paul-rogers you are right, I should not have modified plugin. I have updated the code and addressed other code review comments, please review.

…(EasySubScan)

1. Add system / session option store.table.use_schema_file to control if file schema can be used during query execution. False by default.
2. Added methods in StoragePlugin interface which allow to create Group Scan with provided table schema.
3. EasyGroupScan and EasySubScan now contain table schema, also they are able to serialize / deserialize it along with other scan properties.
4. DrillTable which is the main entry point for schema provisioning, has method to store schema and later uses it to create physical scan.
5. WorkspaceSchema when returning Drill table instance will get table schema from table root if available and if store.table.use_schema_file is set to true.

This PR is the next step for Schema Provisioning project which currently exposes schema only for text reader.
Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good.
+1

@asfgit asfgit closed this in df00912 Mar 19, 2019
gparai pushed a commit to gparai/drill that referenced this pull request Mar 19, 2019
…(EasySubScan)

1. Add system / session option store.table.use_schema_file to control if file schema can be used during query execution. False by default.
2. Added methods in StoragePlugin interface which allow to create Group Scan with provided table schema.
3. EasyGroupScan and EasySubScan now contain table schema, also they are able to serialize / deserialize it along with other scan properties.
4. DrillTable which is the main entry point for schema provisioning, has method to store schema and later uses it to create physical scan.
5. WorkspaceSchema when returning Drill table instance will get table schema from table root if available and if store.table.use_schema_file is set to true.

This PR is the next step for Schema Provisioning project which currently exposes schema only for text reader.

closes apache#1696
lushuifeng pushed a commit to lushuifeng/drill that referenced this pull request Jun 21, 2019
…(EasySubScan)

1. Add system / session option store.table.use_schema_file to control if file schema can be used during query execution. False by default.
2. Added methods in StoragePlugin interface which allow to create Group Scan with provided table schema.
3. EasyGroupScan and EasySubScan now contain table schema, also they are able to serialize / deserialize it along with other scan properties.
4. DrillTable which is the main entry point for schema provisioning, has method to store schema and later uses it to create physical scan.
5. WorkspaceSchema when returning Drill table instance will get table schema from table root if available and if store.table.use_schema_file is set to true.

This PR is the next step for Schema Provisioning project which currently exposes schema only for text reader.

closes apache#1696
xiangt920 pushed a commit to xiangt920/drill that referenced this pull request Dec 26, 2019
…(EasySubScan)

1. Add system / session option store.table.use_schema_file to control if file schema can be used during query execution. False by default.
2. Added methods in StoragePlugin interface which allow to create Group Scan with provided table schema.
3. EasyGroupScan and EasySubScan now contain table schema, also they are able to serialize / deserialize it along with other scan properties.
4. DrillTable which is the main entry point for schema provisioning, has method to store schema and later uses it to create physical scan.
5. WorkspaceSchema when returning Drill table instance will get table schema from table root if available and if store.table.use_schema_file is set to true.

This PR is the next step for Schema Provisioning project which currently exposes schema only for text reader.

closes apache#1696
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants