DRILL-5089: Dynamically load schema of storage plugin only when neede… #1032

chunhui-shi · 2017-11-10T06:22:35Z

…d for every query

For each query, loading all storage plugins and loading all workspaces under file system plugins is not needed.

This patch use DynamicRootSchema as the root schema for Drill. Which loads correspondent storage only when needed.

infoschema to read full schema information and load second level schema accordingly.

for workspaces under the same Filesyetm, no need to create FileSystem for each workspace.

use fs.access API to check permission which is available after HDFS 2.6 except for windows + local file system case.

paul-rogers

Did a superficial code-level review. @arina-ielchiieva, please look at the Calcite aspects.

paul-rogers · 2017-11-13T22:00:52Z

exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DynamicRootSchema.java

+    }
+
+    retSchema = getSubSchemaMap().get(schemaName);
+    return retSchema;


if the original call returns non-null, we make the same call a second time. Better:

retSchema = ... if (retSchema != null) { return retSchema; } loadSchemaFactory(...) return getSubSchemaMap()...

paul-rogers · 2017-11-13T22:01:24Z

exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DynamicRootSchema.java

+
+  @Override
+  public NavigableSet<String> getTableNames() {
+    Set<String> pluginNames = Sets.newHashSet();


Should be case insensitive?

plugin name in drill is case sensitive.

paul-rogers · 2017-11-13T22:05:05Z

exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DynamicRootSchema.java

+      StoragePlugin plugin = getSchemaFactories().getPlugin(schemaName);
+      if (plugin != null) {
+        plugin.registerSchemas(schemaConfig, thisPlus);
+      }


If the name is dfs.test, we first look up the compound name, then the parts? Why? Do we put the compound names in the map? Or can we have one schema named "dfs.test" and another called dfs.test? Or, can this code be restructured a bit?

paul-rogers · 2017-11-13T22:06:23Z

exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DynamicRootSchema.java

+      }
+      else {
+        //this schema name could be `dfs.tmp`, a 2nd level schema under 'dfs'
+        String[] paths = schemaName.split("\\.");


Should this be done here in this simple way? How many other places do we do the same thing? Or, should we have a common function to split schema names so we can handle, way, escapes and other special cases that might come along?

paul-rogers · 2017-11-13T22:07:05Z

exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DynamicRootSchema.java

+      else {
+        //this schema name could be `dfs.tmp`, a 2nd level schema under 'dfs'
+        String[] paths = schemaName.split("\\.");
+        if (paths.length == 2) {


Do we support only 1- and 2-part names? Should we assert that the length <= 2?

paul-rogers · 2017-11-13T22:29:04Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/WorkspaceSchemaFactory.java

+       * In this case, we will still use method listStatus.
+       * In other cases, we use access method since it is cheaper.
+       */
+      if (SystemUtils.IS_OS_WINDOWS && fs.getUri().getScheme().equalsIgnoreCase("file")) {


HDFS probably defines a constant for "file". Should we reference that?

FileSystem in hdfs has a constant DEFAULT_FS "file:///", for now I will define our own.

paul-rogers · 2017-11-13T22:30:18Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/WorkspaceSchemaFactory.java

@@ -175,6 +193,21 @@ public WorkspaceSchema createSchema(List<String> parentSchemaPath, SchemaConfig
    return new WorkspaceSchema(parentSchemaPath, schemaName, schemaConfig);
  }

+  public WorkspaceSchema createSchema(List<String> parentSchemaPath, SchemaConfig schemaConfig, DrillFileSystem fs) throws IOException {
+    if (!accessible(fs)) {


Is returning null sufficient to tell the user that they don't have permission to do this operation?

returning null then user could not even list this workspace, so they don't know the existence of this workspace at all. I think that is a good access control practice.

If users expect to see a workspace but could not see it, then they need to figure out why by themselves.

paul-rogers · 2017-11-13T22:31:26Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/WorkspaceSchemaFactory.java

+      if (this.fs == null) {
+        this.fs = ImpersonationUtil.createFileSystem(schemaConfig.getUserName(), fsConf);
+      }
+      return this.fs;


No need for this.fs, just fs will do.

paul-rogers · 2017-11-13T22:32:30Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/WorkspaceSchemaFactory.java

+      if (this.fs == null) {
+        this.fs = ImpersonationUtil.createFileSystem(schemaConfig.getUserName(), fsConf);
+      }
+      return this.fs;


This class caches the file system, which is good. The other classes in this PR do not; they create the fs as needed.

Does Calcite allow some kind of session state in which we can cache the fs for the query (plan) rather than creating it on the fly each time we need it?

paul-rogers · 2017-11-13T22:33:35Z

exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java

@@ -229,7 +229,7 @@ public DrillbitContext getDrillbitContext() {
    return context;
  }

-  public SchemaPlus getRootSchema() {
+  public SchemaPlus getFullRootSchema() {


Comment to explain what a "full root schema" is? Apparently, this is both the plugin config name and workspace combined?

arina-ielchiieva

@chunhui-shi thanks for making the changes. Please see my comments below...

arina-ielchiieva · 2017-11-17T15:33:28Z

exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestSchema.java

+    String use_dfs = "use dfs.tmp";
+    client.queryBuilder().sql(use_dfs).run();
+    String sql = "SELECT id_i, name_s10 FROM `mock_good`.`employees_5`";
+    client.queryBuilder().sql(sql).printCsv();


Do we actually want to print csv here? I suggest we produce no output here.

arina-ielchiieva · 2017-11-17T15:34:45Z

exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestSchema.java

+    try {
+      client.queryBuilder().sql(sql).printCsv();
+    } catch (Exception ex) {
+      assertTrue(ex.getMessage().contains("VALIDATION ERROR: Schema"));


This test can give false positive result when exception won't be thrown at all. Please re-throw the exception after the check and add @Test(expected = Exception.class).

arina-ielchiieva · 2017-11-17T15:34:57Z

exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestSchema.java

+    }
+  }
+
+  @AfterClass


Can be removed.

arina-ielchiieva · 2017-11-17T15:35:25Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/mock/MockBreakageStorage.java

+
+public class MockBreakageStorage extends MockStorageEngine {
+
+  boolean breakRegister;


arina-ielchiieva · 2017-11-17T15:37:19Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/WorkspaceSchemaFactory.java

+       * In this case, we will still use method listStatus.
+       * In other cases, we use access method since it is cheaper.
+       */
+      if (SystemUtils.IS_OS_WINDOWS && fs.getUri().getScheme().equalsIgnoreCase(FileSystemSchemaFactory.LOCAL_FS_SCHEME)) {


Just in case, did you check that everything works on Windows?

Yes. it was tested in windows unit tests.

arina-ielchiieva · 2017-11-17T15:38:07Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSystemSchemaFactory.java


  public FileSystemSchemaFactory(String schemaName, List<WorkspaceSchemaFactory> factories) {
-    super();
+    // when the correspondent FileSystemPlugin is not passed in, we dig into ANY workspace factory to get it.
+    if (factories.size() > 0 ) {


Please remove space if (factories.size() > 0) {.

arina-ielchiieva · 2017-11-17T15:47:25Z

exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DynamicRootSchema.java

+    }
+    return Compatible.INSTANCE.navigableSet(
+        ImmutableSortedSet.copyOf(
+            Sets.union(pluginNames, getSubSchemaMap().keySet())));


Could you please explain what this method actually returns? According by its name it should return table names but it seems it returns different things...

arina-ielchiieva · 2017-11-17T15:50:34Z

exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DynamicRootSchema.java

+
+        // we could find storage plugin for first part(e.g. 'dfs') of schemaName (e.g. 'dfs.tmp')
+        // register schema for this storage plugin to 'this'.
+        plugin.registerSchemas(schemaConfig, thisPlus);


Can we get NPE here? Let's say that after split on line 97 we got length as 1 or 3?

we get to this place only when that split got an array of length 2 and after we check nullability of plugin, so 'plugin' in this line should not be null if this is your concern.

arina-ielchiieva · 2017-11-17T16:41:56Z

exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/DynamicRootSchema.java

+  }
+
+  static class RootSchema extends AbstractSchema {
+    @Override public Expression getExpression(SchemaPlus parentSchema,


Could you please explain why we override getExpression method?

This is copied from the RootSchema used in SimpleCalciteSchema which class is not public. getExpression is used in Calcite code not in our code.

arina-ielchiieva · 2017-11-17T16:44:19Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/WorkspaceSchemaFactory.java

      super(parentSchemaPath, wsName);
      this.schemaConfig = schemaConfig;
-      this.fs = ImpersonationUtil.createFileSystem(schemaConfig.getUserName(), fsConf);
+      this.fs = fs;


Why don't we anymore need to create fs using ImpersonationUtil but needed before?

Now we pass in fs instead creating from inside of WorkspaceSchema.

arina-ielchiieva

@chunhui-shi could you please resolve the conflicts and reply about NPE asked in previous code review round?

…d for every query For each query, loading all storage plugins and loading all workspaces under file system plugins is not needed. This patch use DynamicRootSchema as the root schema for Drill. Which loads correspondent storage only when needed. infoschema to read full schema information and load second level schema accordingly. for workspaces under the same Filesyetm, no need to create FileSystem for each workspace. use fs.access API to check permission which is available after HDFS 2.6 except for windows + local file system case. Add unit tests to test with a broken mock storage: with a storage that will throw Exception in regiterSchema method, all queries even on good storages shall fail without this fix(Drill still load all schemas from all storages).

arina-ielchiieva · 2017-11-21T10:40:14Z

+1

…d for every query For each query, loading all storage plugins and loading all workspaces under file system plugins is not needed. This patch use DynamicRootSchema as the root schema for Drill. Which loads correspondent storage only when needed. infoschema to read full schema information and load second level schema accordingly. for workspaces under the same Filesyetm, no need to create FileSystem for each workspace. use fs.access API to check permission which is available after HDFS 2.6 except for windows + local file system case. Add unit tests to test with a broken mock storage: with a storage that will throw Exception in regiterSchema method, all queries even on good storages shall fail without this fix(Drill still load all schemas from all storages). This closes apache#1032

paul-rogers requested changes Nov 13, 2017

View reviewed changes

chunhui-shi force-pushed the DRILL-5089-pull branch 2 times, most recently from 9a86032 to 80e702f Compare November 16, 2017 04:47

arina-ielchiieva reviewed Nov 17, 2017

View reviewed changes

arina-ielchiieva reviewed Nov 20, 2017

View reviewed changes

chunhui-shi force-pushed the DRILL-5089-pull branch 2 times, most recently from c096a82 to a0ed588 Compare November 20, 2017 20:18

chunhui-shi force-pushed the DRILL-5089-pull branch from 997212f to 541a46b Compare November 20, 2017 22:33

asfgit closed this in d4c61ca Nov 22, 2017

kkhatua mentioned this pull request Jun 6, 2018

DRILL-4990:Use new HDFS API access instead of listStatus to check if … #652

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-5089: Dynamically load schema of storage plugin only when neede… #1032

DRILL-5089: Dynamically load schema of storage plugin only when neede… #1032

chunhui-shi commented Nov 10, 2017

paul-rogers left a comment

paul-rogers Nov 13, 2017

paul-rogers Nov 13, 2017

chunhui-shi Nov 16, 2017

paul-rogers Nov 13, 2017

paul-rogers Nov 13, 2017

paul-rogers Nov 13, 2017

paul-rogers Nov 13, 2017

chunhui-shi Nov 16, 2017

paul-rogers Nov 13, 2017

chunhui-shi Nov 16, 2017

paul-rogers Nov 13, 2017

paul-rogers Nov 13, 2017

paul-rogers Nov 13, 2017

arina-ielchiieva left a comment

arina-ielchiieva Nov 17, 2017

arina-ielchiieva Nov 17, 2017

arina-ielchiieva Nov 17, 2017

arina-ielchiieva Nov 17, 2017

arina-ielchiieva Nov 17, 2017

chunhui-shi Nov 17, 2017

arina-ielchiieva Nov 17, 2017

arina-ielchiieva Nov 17, 2017

arina-ielchiieva Nov 17, 2017

chunhui-shi Nov 20, 2017

arina-ielchiieva Nov 17, 2017

chunhui-shi Nov 17, 2017

arina-ielchiieva Nov 17, 2017

chunhui-shi Nov 17, 2017

arina-ielchiieva left a comment

arina-ielchiieva commented Nov 21, 2017


		public class MockBreakageStorage extends MockStorageEngine {

		boolean breakRegister;

DRILL-5089: Dynamically load schema of storage plugin only when neede… #1032

DRILL-5089: Dynamically load schema of storage plugin only when neede… #1032

Conversation

chunhui-shi commented Nov 10, 2017

paul-rogers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arina-ielchiieva left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arina-ielchiieva left a comment

Choose a reason for hiding this comment

arina-ielchiieva commented Nov 21, 2017