-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-5089: Dynamically load schema of storage plugin only when neede… #1032
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a superficial code-level review. @arina-ielchiieva, please look at the Calcite aspects.
} | ||
|
||
retSchema = getSubSchemaMap().get(schemaName); | ||
return retSchema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the original call returns non-null, we make the same call a second time. Better:
retSchema = ...
if (retSchema != null) { return retSchema; }
loadSchemaFactory(...)
return getSubSchemaMap()...
|
||
@Override | ||
public NavigableSet<String> getTableNames() { | ||
Set<String> pluginNames = Sets.newHashSet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be case insensitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plugin name in drill is case sensitive.
StoragePlugin plugin = getSchemaFactories().getPlugin(schemaName); | ||
if (plugin != null) { | ||
plugin.registerSchemas(schemaConfig, thisPlus); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the name is dfs.test
, we first look up the compound name, then the parts? Why? Do we put the compound names in the map? Or can we have one schema named "dfs.test" and another called dfs
.test
? Or, can this code be restructured a bit?
} | ||
else { | ||
//this schema name could be `dfs.tmp`, a 2nd level schema under 'dfs' | ||
String[] paths = schemaName.split("\\."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be done here in this simple way? How many other places do we do the same thing? Or, should we have a common function to split schema names so we can handle, way, escapes and other special cases that might come along?
else { | ||
//this schema name could be `dfs.tmp`, a 2nd level schema under 'dfs' | ||
String[] paths = schemaName.split("\\."); | ||
if (paths.length == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support only 1- and 2-part names? Should we assert that the length <= 2?
* In this case, we will still use method listStatus. | ||
* In other cases, we use access method since it is cheaper. | ||
*/ | ||
if (SystemUtils.IS_OS_WINDOWS && fs.getUri().getScheme().equalsIgnoreCase("file")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HDFS probably defines a constant for "file". Should we reference that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileSystem in hdfs has a constant DEFAULT_FS "file:///", for now I will define our own.
@@ -175,6 +193,21 @@ public WorkspaceSchema createSchema(List<String> parentSchemaPath, SchemaConfig | |||
return new WorkspaceSchema(parentSchemaPath, schemaName, schemaConfig); | |||
} | |||
|
|||
public WorkspaceSchema createSchema(List<String> parentSchemaPath, SchemaConfig schemaConfig, DrillFileSystem fs) throws IOException { | |||
if (!accessible(fs)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is returning null sufficient to tell the user that they don't have permission to do this operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning null then user could not even list this workspace, so they don't know the existence of this workspace at all. I think that is a good access control practice.
If users expect to see a workspace but could not see it, then they need to figure out why by themselves.
if (this.fs == null) { | ||
this.fs = ImpersonationUtil.createFileSystem(schemaConfig.getUserName(), fsConf); | ||
} | ||
return this.fs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this.fs
, just fs
will do.
if (this.fs == null) { | ||
this.fs = ImpersonationUtil.createFileSystem(schemaConfig.getUserName(), fsConf); | ||
} | ||
return this.fs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class caches the file system, which is good. The other classes in this PR do not; they create the fs as needed.
Does Calcite allow some kind of session state in which we can cache the fs for the query (plan) rather than creating it on the fly each time we need it?
@@ -229,7 +229,7 @@ public DrillbitContext getDrillbitContext() { | |||
return context; | |||
} | |||
|
|||
public SchemaPlus getRootSchema() { | |||
public SchemaPlus getFullRootSchema() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment to explain what a "full root schema" is? Apparently, this is both the plugin config name and workspace combined?
9a86032
to
80e702f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chunhui-shi thanks for making the changes. Please see my comments below...
String use_dfs = "use dfs.tmp"; | ||
client.queryBuilder().sql(use_dfs).run(); | ||
String sql = "SELECT id_i, name_s10 FROM `mock_good`.`employees_5`"; | ||
client.queryBuilder().sql(sql).printCsv(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually want to print csv here? I suggest we produce no output here.
try { | ||
client.queryBuilder().sql(sql).printCsv(); | ||
} catch (Exception ex) { | ||
assertTrue(ex.getMessage().contains("VALIDATION ERROR: Schema")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test can give false positive result when exception won't be thrown at all. Please re-throw the exception after the check and add @Test(expected = Exception.class)
.
} | ||
} | ||
|
||
@AfterClass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be removed.
|
||
public class MockBreakageStorage extends MockStorageEngine { | ||
|
||
boolean breakRegister; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private
* In this case, we will still use method listStatus. | ||
* In other cases, we use access method since it is cheaper. | ||
*/ | ||
if (SystemUtils.IS_OS_WINDOWS && fs.getUri().getScheme().equalsIgnoreCase(FileSystemSchemaFactory.LOCAL_FS_SCHEME)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, did you check that everything works on Windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. it was tested in windows unit tests.
|
||
public FileSystemSchemaFactory(String schemaName, List<WorkspaceSchemaFactory> factories) { | ||
super(); | ||
// when the correspondent FileSystemPlugin is not passed in, we dig into ANY workspace factory to get it. | ||
if (factories.size() > 0 ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove space if (factories.size() > 0) {
.
} | ||
return Compatible.INSTANCE.navigableSet( | ||
ImmutableSortedSet.copyOf( | ||
Sets.union(pluginNames, getSubSchemaMap().keySet()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain what this method actually returns? According by its name it should return table names but it seems it returns different things...
|
||
// we could find storage plugin for first part(e.g. 'dfs') of schemaName (e.g. 'dfs.tmp') | ||
// register schema for this storage plugin to 'this'. | ||
plugin.registerSchemas(schemaConfig, thisPlus); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get NPE here? Let's say that after split on line 97 we got length as 1 or 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we get to this place only when that split got an array of length 2 and after we check nullability of plugin, so 'plugin' in this line should not be null if this is your concern.
} | ||
|
||
static class RootSchema extends AbstractSchema { | ||
@Override public Expression getExpression(SchemaPlus parentSchema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why we override getExpression
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from the RootSchema used in SimpleCalciteSchema which class is not public. getExpression is used in Calcite code not in our code.
super(parentSchemaPath, wsName); | ||
this.schemaConfig = schemaConfig; | ||
this.fs = ImpersonationUtil.createFileSystem(schemaConfig.getUserName(), fsConf); | ||
this.fs = fs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we anymore need to create fs using ImpersonationUtil
but needed before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we pass in fs instead creating from inside of WorkspaceSchema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chunhui-shi could you please resolve the conflicts and reply about NPE asked in previous code review round?
c096a82
to
a0ed588
Compare
…d for every query For each query, loading all storage plugins and loading all workspaces under file system plugins is not needed. This patch use DynamicRootSchema as the root schema for Drill. Which loads correspondent storage only when needed. infoschema to read full schema information and load second level schema accordingly. for workspaces under the same Filesyetm, no need to create FileSystem for each workspace. use fs.access API to check permission which is available after HDFS 2.6 except for windows + local file system case. Add unit tests to test with a broken mock storage: with a storage that will throw Exception in regiterSchema method, all queries even on good storages shall fail without this fix(Drill still load all schemas from all storages).
997212f
to
541a46b
Compare
+1 |
…d for every query For each query, loading all storage plugins and loading all workspaces under file system plugins is not needed. This patch use DynamicRootSchema as the root schema for Drill. Which loads correspondent storage only when needed. infoschema to read full schema information and load second level schema accordingly. for workspaces under the same Filesyetm, no need to create FileSystem for each workspace. use fs.access API to check permission which is available after HDFS 2.6 except for windows + local file system case. Add unit tests to test with a broken mock storage: with a storage that will throw Exception in regiterSchema method, all queries even on good storages shall fail without this fix(Drill still load all schemas from all storages). This closes apache#1032
…d for every query For each query, loading all storage plugins and loading all workspaces under file system plugins is not needed. This patch use DynamicRootSchema as the root schema for Drill. Which loads correspondent storage only when needed. infoschema to read full schema information and load second level schema accordingly. for workspaces under the same Filesyetm, no need to create FileSystem for each workspace. use fs.access API to check permission which is available after HDFS 2.6 except for windows + local file system case. Add unit tests to test with a broken mock storage: with a storage that will throw Exception in regiterSchema method, all queries even on good storages shall fail without this fix(Drill still load all schemas from all storages). This closes apache#1032
…d for every query
For each query, loading all storage plugins and loading all workspaces under file system plugins is not needed.
This patch use DynamicRootSchema as the root schema for Drill. Which loads correspondent storage only when needed.
infoschema to read full schema information and load second level schema accordingly.
for workspaces under the same Filesyetm, no need to create FileSystem for each workspace.
use fs.access API to check permission which is available after HDFS 2.6 except for windows + local file system case.