New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-7095: Expose table schema (TupleMetadata) to physical operator (EasySubScan) #1696
Conversation
@paul-rogers please review. |
4234440
to
c8b86b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. Major concern is that the schema should be a property of the group scan and sub scan, not of the storage plugin config.
return table; | ||
} | ||
|
||
private void setSchema(DrillTable table, String tableName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the DrillTable know its own name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might seem weird but it does not, only contains selection of files.
FsMetastoreSchemaProvider schemaProvider = new FsMetastoreSchemaProvider(this, tableName); | ||
table.setSchema(schemaProvider.read().getSchema()); | ||
} catch (IOException e) { | ||
logger.debug("Unable to deserialize schema from schema file for table: " + tableName, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, this might handle other schema sources than just the file. This is a perfectly fine implementation for getting started; just something to remember for later.
} | ||
|
||
public EasyGroupScan(String userName, FileSelection selection, EasyFormatPlugin<?> formatPlugin, | ||
List<SchemaPath> columns, Path selectionRoot) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed now, but the arg list is getting long enough that I suspect we'll want a "builder" abstraction at some point.
) throws IOException{ | ||
super(userName); | ||
this.selection = Preconditions.checkNotNull(selection); | ||
this.formatPlugin = Preconditions.checkNotNull(formatPlugin, "Unable to load format plugin for provided format config."); | ||
if (schema != null) { | ||
this.formatPlugin.setSchema(schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this makes sense. The format plugin is shared across multiple scans. I don't believe it is copied anew for each scan. So, I think the schema should be a property of the group scan, not the plugin.
Also, the plugin is not serialized; it is created anew (IIRC) on each node.
Note that, once the schema is an attribute of the group scan, it needs to be set in the copy constructor. Since we don't expect the schema to change, we can just have the copy reuse the same schema as the original: no need to copy the schema itself. Should add a comment to explain this.
|
||
@JsonProperty | ||
public TupleMetadata getSchema() { | ||
return formatPlugin.getSchema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, schema should be a property of this class. Once it is, the pointer to the schema must be copied to the sub-scan class in the getSpecificScan() method.
It is the sub-scan that will be serialized out to the physical plan, then deserialized on each worker node.
super(userName); | ||
this.formatPlugin = (EasyFormatPlugin<?>) engineRegistry.getFormatPlugin(storageConfig, formatConfig); | ||
Preconditions.checkNotNull(this.formatPlugin); | ||
this.formatPlugin.setSchema(schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, can't store this in the plugin: the plugin is (or should be) shared.
Or, are we serializing the plugin (config) to handle table properties? If so, then the schema can be set via the Drill web console, which I don't think we want as it will encourage people to create plugin configs for table schema, but that will require that each file have a distinct file suffix, which will cause confusion. (Already does for CSV/CSVH).
@@ -219,6 +220,8 @@ protected ColumnsScanFramework buildFramework( | |||
} | |||
} | |||
|
|||
private TupleMetadata schema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See earlier notes.
@Override | ||
public void setSchema(TupleMetadata schema) { | ||
this.schema = schema; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be best if the schema can be immutable: set in the constructor and not changeable once set. Else, it is very unclear what the semantics will be.
Same is true, by the way, for the scan nodes.
@@ -683,5 +683,6 @@ drill.exec.options: { | |||
exec.query.return_result_set_for_ddl: true, | |||
# ========= rm related options =========== | |||
exec.rm.queryTags: "", | |||
exec.rm.queues.wait_for_preferred_nodes: true | |||
exec.rm.queues.wait_for_preferred_nodes: true, | |||
store.table.use_schema_file: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this closer to other store-related options?
c8b86b1
to
8836c62
Compare
@paul-rogers you are right, I should not have modified plugin. I have updated the code and addressed other code review comments, please review. |
…(EasySubScan) 1. Add system / session option store.table.use_schema_file to control if file schema can be used during query execution. False by default. 2. Added methods in StoragePlugin interface which allow to create Group Scan with provided table schema. 3. EasyGroupScan and EasySubScan now contain table schema, also they are able to serialize / deserialize it along with other scan properties. 4. DrillTable which is the main entry point for schema provisioning, has method to store schema and later uses it to create physical scan. 5. WorkspaceSchema when returning Drill table instance will get table schema from table root if available and if store.table.use_schema_file is set to true. This PR is the next step for Schema Provisioning project which currently exposes schema only for text reader.
8836c62
to
812cc3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good.
+1
…(EasySubScan) 1. Add system / session option store.table.use_schema_file to control if file schema can be used during query execution. False by default. 2. Added methods in StoragePlugin interface which allow to create Group Scan with provided table schema. 3. EasyGroupScan and EasySubScan now contain table schema, also they are able to serialize / deserialize it along with other scan properties. 4. DrillTable which is the main entry point for schema provisioning, has method to store schema and later uses it to create physical scan. 5. WorkspaceSchema when returning Drill table instance will get table schema from table root if available and if store.table.use_schema_file is set to true. This PR is the next step for Schema Provisioning project which currently exposes schema only for text reader. closes apache#1696
…(EasySubScan) 1. Add system / session option store.table.use_schema_file to control if file schema can be used during query execution. False by default. 2. Added methods in StoragePlugin interface which allow to create Group Scan with provided table schema. 3. EasyGroupScan and EasySubScan now contain table schema, also they are able to serialize / deserialize it along with other scan properties. 4. DrillTable which is the main entry point for schema provisioning, has method to store schema and later uses it to create physical scan. 5. WorkspaceSchema when returning Drill table instance will get table schema from table root if available and if store.table.use_schema_file is set to true. This PR is the next step for Schema Provisioning project which currently exposes schema only for text reader. closes apache#1696
…(EasySubScan) 1. Add system / session option store.table.use_schema_file to control if file schema can be used during query execution. False by default. 2. Added methods in StoragePlugin interface which allow to create Group Scan with provided table schema. 3. EasyGroupScan and EasySubScan now contain table schema, also they are able to serialize / deserialize it along with other scan properties. 4. DrillTable which is the main entry point for schema provisioning, has method to store schema and later uses it to create physical scan. 5. WorkspaceSchema when returning Drill table instance will get table schema from table root if available and if store.table.use_schema_file is set to true. This PR is the next step for Schema Provisioning project which currently exposes schema only for text reader. closes apache#1696
store.table.use_schema_file
to control if file schema can be used during query execution. False by default.store.table.use_schema_file
is set to true.This PR is the next step for Schema Provisioning project which currently exposes schema only for text reader.
Jira: DRILL-7095.