DRILL-7095: Expose table schema (TupleMetadata) to physical operator (EasySubScan) #1696

arina-ielchiieva · 2019-03-14T20:33:03Z

Add system / session option store.table.use_schema_file to control if file schema can be used during query execution. False by default.
Added methods in StoragePlugin interface which allow to create Group Scan with provided table schema.
EasyGroupScan and EasySubScan now contain table schema, also they are able to serialize / deserialize it along with other scan properties.
DrillTable which is the main entry point for schema provisioning, has method to store schema and later uses it to create physical scan.
WorkspaceSchema when returning Drill table instance will get table schema from table root if available and if store.table.use_schema_file is set to true.

This PR is the next step for Schema Provisioning project which currently exposes schema only for text reader.

arina-ielchiieva · 2019-03-14T20:33:18Z

paul-rogers

Mostly looks good. Major concern is that the schema should be a property of the group scan and sub scan, not of the storage plugin config.

paul-rogers · 2019-03-16T19:47:28Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/WorkspaceSchemaFactory.java

      return table;
    }

+    private void setSchema(DrillTable table, String tableName) {


Doesn't the DrillTable know its own name?

Might seem weird but it does not, only contains selection of files.

https://github.com/apache/drill/blob/469be17597e7b7c6bc1de9863dcb6c5604a55f0c/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillTable.java

paul-rogers · 2019-03-16T19:48:44Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/WorkspaceSchemaFactory.java

+          FsMetastoreSchemaProvider schemaProvider = new FsMetastoreSchemaProvider(this, tableName);
+          table.setSchema(schemaProvider.read().getSchema());
+        } catch (IOException e) {
+          logger.debug("Unable to deserialize schema from schema file for table: " + tableName, e);


In the future, this might handle other schema sources than just the file. This is a perfectly fine implementation for getting started; just something to remember for later.

paul-rogers · 2019-03-16T19:50:38Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/easy/EasyGroupScan.java

+  }
+
+  public EasyGroupScan(String userName, FileSelection selection, EasyFormatPlugin<?> formatPlugin,
+                       List<SchemaPath> columns, Path selectionRoot) throws IOException {


Not needed now, but the arg list is getting long enough that I suspect we'll want a "builder" abstraction at some point.

paul-rogers · 2019-03-16T19:55:30Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/easy/EasyGroupScan.java

      ) throws IOException{
    super(userName);
    this.selection = Preconditions.checkNotNull(selection);
    this.formatPlugin = Preconditions.checkNotNull(formatPlugin, "Unable to load format plugin for provided format config.");
+    if (schema != null) {
+      this.formatPlugin.setSchema(schema);


Not sure this makes sense. The format plugin is shared across multiple scans. I don't believe it is copied anew for each scan. So, I think the schema should be a property of the group scan, not the plugin.

Also, the plugin is not serialized; it is created anew (IIRC) on each node.

Note that, once the schema is an attribute of the group scan, it needs to be set in the copy constructor. Since we don't expect the schema to change, we can just have the copy reuse the same schema as the original: no need to copy the schema itself. Should add a comment to explain this.

paul-rogers · 2019-03-16T19:57:59Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/easy/EasyGroupScan.java

+
+  @JsonProperty
+  public TupleMetadata getSchema() {
+    return formatPlugin.getSchema();


Again, schema should be a property of this class. Once it is, the pointer to the schema must be copied to the sub-scan class in the getSpecificScan() method.

It is the sub-scan that will be serialized out to the physical plan, then deserialized on each worker node.

paul-rogers · 2019-03-16T20:00:21Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/easy/EasySubScan.java

    super(userName);
    this.formatPlugin = (EasyFormatPlugin<?>) engineRegistry.getFormatPlugin(storageConfig, formatConfig);
    Preconditions.checkNotNull(this.formatPlugin);
+    this.formatPlugin.setSchema(schema);


Again, can't store this in the plugin: the plugin is (or should be) shared.

Or, are we serializing the plugin (config) to handle table properties? If so, then the schema can be set via the Drill web console, which I don't think we want as it will encourage people to create plugin configs for table schema, but that will require that each file have a distinct file suffix, which will cause confusion. (Already does for CSV/CSVH).

paul-rogers · 2019-03-16T20:00:37Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatPlugin.java

@@ -219,6 +220,8 @@ protected ColumnsScanFramework buildFramework(
    }
  }

+  private TupleMetadata schema;


See earlier notes.

paul-rogers · 2019-03-16T20:01:53Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/TextFormatPlugin.java

+  @Override
+  public void setSchema(TupleMetadata schema) {
+    this.schema = schema;
+  }


Would be best if the schema can be immutable: set in the constructor and not changeable once set. Else, it is very unclear what the semantics will be.

Same is true, by the way, for the scan nodes.

paul-rogers · 2019-03-16T20:02:20Z

exec/java-exec/src/main/resources/drill-module.conf

@@ -683,5 +683,6 @@ drill.exec.options: {
    exec.query.return_result_set_for_ddl: true,
    # ========= rm related options ===========
    exec.rm.queryTags: "",
-    exec.rm.queues.wait_for_preferred_nodes: true
+    exec.rm.queues.wait_for_preferred_nodes: true,
+    store.table.use_schema_file: false


Move this closer to other store-related options?

arina-ielchiieva · 2019-03-18T14:13:13Z

@paul-rogers you are right, I should not have modified plugin. I have updated the code and addressed other code review comments, please review.

…(EasySubScan) 1. Add system / session option store.table.use_schema_file to control if file schema can be used during query execution. False by default. 2. Added methods in StoragePlugin interface which allow to create Group Scan with provided table schema. 3. EasyGroupScan and EasySubScan now contain table schema, also they are able to serialize / deserialize it along with other scan properties. 4. DrillTable which is the main entry point for schema provisioning, has method to store schema and later uses it to create physical scan. 5. WorkspaceSchema when returning Drill table instance will get table schema from table root if available and if store.table.use_schema_file is set to true. This PR is the next step for Schema Provisioning project which currently exposes schema only for text reader.

paul-rogers

Looks very good.
+1

…(EasySubScan) 1. Add system / session option store.table.use_schema_file to control if file schema can be used during query execution. False by default. 2. Added methods in StoragePlugin interface which allow to create Group Scan with provided table schema. 3. EasyGroupScan and EasySubScan now contain table schema, also they are able to serialize / deserialize it along with other scan properties. 4. DrillTable which is the main entry point for schema provisioning, has method to store schema and later uses it to create physical scan. 5. WorkspaceSchema when returning Drill table instance will get table schema from table root if available and if store.table.use_schema_file is set to true. This PR is the next step for Schema Provisioning project which currently exposes schema only for text reader. closes apache#1696

arina-ielchiieva force-pushed the DRILL-7095 branch from 4234440 to c8b86b1 Compare March 15, 2019 12:49

paul-rogers reviewed Mar 16, 2019

View reviewed changes

arina-ielchiieva force-pushed the DRILL-7095 branch from c8b86b1 to 8836c62 Compare March 18, 2019 14:09

arina-ielchiieva added 2 commits March 18, 2019 16:34

DRILL-7095: Changes after code review

812cc3b

arina-ielchiieva force-pushed the DRILL-7095 branch from 8836c62 to 812cc3b Compare March 18, 2019 14:35

paul-rogers approved these changes Mar 19, 2019

View reviewed changes

asfgit closed this in df00912 Mar 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-7095: Expose table schema (TupleMetadata) to physical operator (EasySubScan) #1696

DRILL-7095: Expose table schema (TupleMetadata) to physical operator (EasySubScan) #1696

arina-ielchiieva commented Mar 14, 2019 •

edited

arina-ielchiieva commented Mar 14, 2019

paul-rogers left a comment

paul-rogers Mar 16, 2019

arina-ielchiieva Mar 18, 2019

paul-rogers Mar 16, 2019

paul-rogers Mar 16, 2019

paul-rogers Mar 16, 2019

paul-rogers Mar 16, 2019

paul-rogers Mar 16, 2019

paul-rogers Mar 16, 2019

paul-rogers Mar 16, 2019

paul-rogers Mar 16, 2019 •

edited

arina-ielchiieva commented Mar 18, 2019

paul-rogers left a comment

DRILL-7095: Expose table schema (TupleMetadata) to physical operator (EasySubScan) #1696

DRILL-7095: Expose table schema (TupleMetadata) to physical operator (EasySubScan) #1696

Conversation

arina-ielchiieva commented Mar 14, 2019 • edited

arina-ielchiieva commented Mar 14, 2019

paul-rogers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-rogers Mar 16, 2019 • edited

Choose a reason for hiding this comment

arina-ielchiieva commented Mar 18, 2019

paul-rogers left a comment

Choose a reason for hiding this comment

arina-ielchiieva commented Mar 14, 2019 •

edited

paul-rogers Mar 16, 2019 •

edited