Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-5089: Dynamically load schema of storage plugin only when neede… #1032

Closed
wants to merge 1 commit into from

Conversation

chunhui-shi
Copy link
Contributor

…d for every query

For each query, loading all storage plugins and loading all workspaces under file system plugins is not needed.

This patch use DynamicRootSchema as the root schema for Drill. Which loads correspondent storage only when needed.

infoschema to read full schema information and load second level schema accordingly.

for workspaces under the same Filesyetm, no need to create FileSystem for each workspace.

use fs.access API to check permission which is available after HDFS 2.6 except for windows + local file system case.

Copy link
Contributor

@paul-rogers paul-rogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a superficial code-level review. @arina-ielchiieva, please look at the Calcite aspects.

}

retSchema = getSubSchemaMap().get(schemaName);
return retSchema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the original call returns non-null, we make the same call a second time. Better:

retSchema = ...
if (retSchema != null) { return retSchema; }
loadSchemaFactory(...)
return getSubSchemaMap()...


@Override
public NavigableSet<String> getTableNames() {
Set<String> pluginNames = Sets.newHashSet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be case insensitive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plugin name in drill is case sensitive.

StoragePlugin plugin = getSchemaFactories().getPlugin(schemaName);
if (plugin != null) {
plugin.registerSchemas(schemaConfig, thisPlus);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the name is dfs.test, we first look up the compound name, then the parts? Why? Do we put the compound names in the map? Or can we have one schema named "dfs.test" and another called dfs.test? Or, can this code be restructured a bit?

}
else {
//this schema name could be `dfs.tmp`, a 2nd level schema under 'dfs'
String[] paths = schemaName.split("\\.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be done here in this simple way? How many other places do we do the same thing? Or, should we have a common function to split schema names so we can handle, way, escapes and other special cases that might come along?

else {
//this schema name could be `dfs.tmp`, a 2nd level schema under 'dfs'
String[] paths = schemaName.split("\\.");
if (paths.length == 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support only 1- and 2-part names? Should we assert that the length <= 2?

* In this case, we will still use method listStatus.
* In other cases, we use access method since it is cheaper.
*/
if (SystemUtils.IS_OS_WINDOWS && fs.getUri().getScheme().equalsIgnoreCase("file")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HDFS probably defines a constant for "file". Should we reference that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileSystem in hdfs has a constant DEFAULT_FS "file:///", for now I will define our own.

@@ -175,6 +193,21 @@ public WorkspaceSchema createSchema(List<String> parentSchemaPath, SchemaConfig
return new WorkspaceSchema(parentSchemaPath, schemaName, schemaConfig);
}

public WorkspaceSchema createSchema(List<String> parentSchemaPath, SchemaConfig schemaConfig, DrillFileSystem fs) throws IOException {
if (!accessible(fs)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is returning null sufficient to tell the user that they don't have permission to do this operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning null then user could not even list this workspace, so they don't know the existence of this workspace at all. I think that is a good access control practice.

If users expect to see a workspace but could not see it, then they need to figure out why by themselves.

if (this.fs == null) {
this.fs = ImpersonationUtil.createFileSystem(schemaConfig.getUserName(), fsConf);
}
return this.fs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this.fs, just fs will do.

if (this.fs == null) {
this.fs = ImpersonationUtil.createFileSystem(schemaConfig.getUserName(), fsConf);
}
return this.fs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class caches the file system, which is good. The other classes in this PR do not; they create the fs as needed.

Does Calcite allow some kind of session state in which we can cache the fs for the query (plan) rather than creating it on the fly each time we need it?

@@ -229,7 +229,7 @@ public DrillbitContext getDrillbitContext() {
return context;
}

public SchemaPlus getRootSchema() {
public SchemaPlus getFullRootSchema() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment to explain what a "full root schema" is? Apparently, this is both the plugin config name and workspace combined?

@chunhui-shi chunhui-shi force-pushed the DRILL-5089-pull branch 2 times, most recently from 9a86032 to 80e702f Compare November 16, 2017 04:47
Copy link
Member

@arina-ielchiieva arina-ielchiieva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chunhui-shi thanks for making the changes. Please see my comments below...

String use_dfs = "use dfs.tmp";
client.queryBuilder().sql(use_dfs).run();
String sql = "SELECT id_i, name_s10 FROM `mock_good`.`employees_5`";
client.queryBuilder().sql(sql).printCsv();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually want to print csv here? I suggest we produce no output here.

try {
client.queryBuilder().sql(sql).printCsv();
} catch (Exception ex) {
assertTrue(ex.getMessage().contains("VALIDATION ERROR: Schema"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can give false positive result when exception won't be thrown at all. Please re-throw the exception after the check and add @Test(expected = Exception.class).

}
}

@AfterClass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed.


public class MockBreakageStorage extends MockStorageEngine {

boolean breakRegister;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private

* In this case, we will still use method listStatus.
* In other cases, we use access method since it is cheaper.
*/
if (SystemUtils.IS_OS_WINDOWS && fs.getUri().getScheme().equalsIgnoreCase(FileSystemSchemaFactory.LOCAL_FS_SCHEME)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, did you check that everything works on Windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. it was tested in windows unit tests.


public FileSystemSchemaFactory(String schemaName, List<WorkspaceSchemaFactory> factories) {
super();
// when the correspondent FileSystemPlugin is not passed in, we dig into ANY workspace factory to get it.
if (factories.size() > 0 ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove space if (factories.size() > 0) {.

}
return Compatible.INSTANCE.navigableSet(
ImmutableSortedSet.copyOf(
Sets.union(pluginNames, getSubSchemaMap().keySet())));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain what this method actually returns? According by its name it should return table names but it seems it returns different things...


// we could find storage plugin for first part(e.g. 'dfs') of schemaName (e.g. 'dfs.tmp')
// register schema for this storage plugin to 'this'.
plugin.registerSchemas(schemaConfig, thisPlus);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get NPE here? Let's say that after split on line 97 we got length as 1 or 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we get to this place only when that split got an array of length 2 and after we check nullability of plugin, so 'plugin' in this line should not be null if this is your concern.

}

static class RootSchema extends AbstractSchema {
@Override public Expression getExpression(SchemaPlus parentSchema,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why we override getExpression method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from the RootSchema used in SimpleCalciteSchema which class is not public. getExpression is used in Calcite code not in our code.

super(parentSchemaPath, wsName);
this.schemaConfig = schemaConfig;
this.fs = ImpersonationUtil.createFileSystem(schemaConfig.getUserName(), fsConf);
this.fs = fs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we anymore need to create fs using ImpersonationUtil but needed before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we pass in fs instead creating from inside of WorkspaceSchema.

Copy link
Member

@arina-ielchiieva arina-ielchiieva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chunhui-shi could you please resolve the conflicts and reply about NPE asked in previous code review round?

@chunhui-shi chunhui-shi force-pushed the DRILL-5089-pull branch 2 times, most recently from c096a82 to a0ed588 Compare November 20, 2017 20:18
…d for every query

For each query, loading all storage plugins and loading all workspaces under file system plugins is not needed.

This patch use DynamicRootSchema as the root schema for Drill. Which loads correspondent storage only when needed.

infoschema to read full schema information and load second level schema accordingly.

for workspaces under the same Filesyetm, no need to create FileSystem for each workspace.

use fs.access API to check permission which is available after HDFS 2.6 except for windows + local file system case.

Add unit tests to test with a broken mock storage: with a storage that will throw Exception in regiterSchema method,
all queries even on good storages shall fail without this fix(Drill still load all schemas from all storages).
@arina-ielchiieva
Copy link
Member

+1

@asfgit asfgit closed this in d4c61ca Nov 22, 2017
chunhui-shi added a commit to chunhui-shi/drill that referenced this pull request Dec 22, 2017
…d for every query

For each query, loading all storage plugins and loading all workspaces under file system plugins is not needed.

This patch use DynamicRootSchema as the root schema for Drill. Which loads correspondent storage only when needed.

infoschema to read full schema information and load second level schema accordingly.

for workspaces under the same Filesyetm, no need to create FileSystem for each workspace.

use fs.access API to check permission which is available after HDFS 2.6 except for windows + local file system case.

Add unit tests to test with a broken mock storage: with a storage that will throw Exception in regiterSchema method,
all queries even on good storages shall fail without this fix(Drill still load all schemas from all storages).

This closes apache#1032
ilooner pushed a commit to ilooner/drill that referenced this pull request Jan 30, 2018
…d for every query

For each query, loading all storage plugins and loading all workspaces under file system plugins is not needed.

This patch use DynamicRootSchema as the root schema for Drill. Which loads correspondent storage only when needed.

infoschema to read full schema information and load second level schema accordingly.

for workspaces under the same Filesyetm, no need to create FileSystem for each workspace.

use fs.access API to check permission which is available after HDFS 2.6 except for windows + local file system case.

Add unit tests to test with a broken mock storage: with a storage that will throw Exception in regiterSchema method,
all queries even on good storages shall fail without this fix(Drill still load all schemas from all storages).

This closes apache#1032
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants