Skip to content

DRILL-6454: Native MapR DB plugin support for Hive MapR-DB json table#1314

Closed
vdiravka wants to merge 2 commits intoapache:masterfrom
vdiravka:DRILL-6454
Closed

DRILL-6454: Native MapR DB plugin support for Hive MapR-DB json table#1314
vdiravka wants to merge 2 commits intoapache:masterfrom
vdiravka:DRILL-6454

Conversation

@vdiravka
Copy link
Member

@vdiravka vdiravka commented Jun 9, 2018

  • The new StoragePluginOptimizerRule for reading MapR-DB JSON tables by Drill native reader is added.
    The rule can allow in planning stage to replace HiveScan with Drill native JsonTableGroupScan.
  • A new system session option is added to use the new rule.
  • The common part between ConvertHiveMapRDBJsonScanToDrillMapRDBJsonScan and ConvertHiveParquetScanToDrillParquetScan are factored out into HiveUtilities.
  • Dependency onto drill-format-mapr for hive-core is added.

Note:

  1. The option name for parquet native reader is changed.
  2. Changes in MapRDBFormatMatcher.java is just refactoring.

@vdiravka vdiravka self-assigned this Jun 11, 2018
@vdiravka
Copy link
Member Author

@gparai @vrozov Could you review please this PR?

@gparai
Copy link

gparai commented Jun 11, 2018

@vdiravka do you have any writeup/dspec for this feature?

@vdiravka
Copy link
Member Author

@gparai I have described the approach in the Jira description and described the changes in the PR description. There is no need in "Design document" for this feature, since it is too small.

@priteshm
Copy link

@gparai can you please complete the review soon?

import org.apache.drill.exec.store.mapr.db.binary.MapRDBBinaryTable;
import org.apache.hadoop.fs.Path;

import java.io.IOException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid import order changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Just changed com.mapr.fs.MapRFileStatus place, it should be near com.mapr.fs.tables.TableProperties

security.admin.users: "%drill_process_user%",
store.format: "parquet",
store.hive.optimize_scan_with_native_readers: false,
store.hive.parquet.optimize_scan_with_native_readers: false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to maintain backward compatibility with prior versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a compulsion to change the name, since early it was only one native reader. And to control all native readers with one property is not possible.

So i have decided to keep the old property name as Deprecated for backward compatibility and to leave a comment that it should be removed starting from next 1.15.0 release.

The new property name is also introduced, so both of them can enable parquet native reader.

}

public Collection<InputSplit> getInputSplits() {
public List<InputSplit> getInputSplits() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Collection->List? Where order comes from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a question whether it is known to be a List or an ArrayList (it is probably known that it is an ArrayList). List provides certain behavior (allows duplicate and provides insertion order guarantee). Do you need any of those?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small, but right remark here. Thanks. I have returned this to Collection. There is no need to use List.

I think Collection is used because of a Map#values() returns it in some places.
From the code it looks like it is enough (at least now) the API of Collection for logicalInputSplit.
Also I found other places with Collection<FileSplit> logicalInputSplit and List<LogicalInputSplit> getInputSplits(). Possibly they should be leaded to the common style.

</exclusion>
</exclusions>
</dependency>
<dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the dependency introduced?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is contrib/format-maprdb module from default profile.
Therefore it is used here in default profile too.
Sources from this dependency is used in ConvertHiveMapRDBJsonScanToDrillMapRDBJsonScan.java.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. I was confused due to Drill practice to use the same package name in different modules/jars. This practice is not common and far from best practices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what package name would be better to use?
Do you mean short names instead of full ones for the common packages?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't refer to using short names. I refer to using "org.apache.drill.exec" package name outside of "exec". As it is already widely used in Drill I don't suggest to change it in this PR.

public static final String HIVE_OPTIMIZE_SCAN_WITH_NATIVE_READERS = "store.hive.optimize_scan_with_native_readers";
public static final OptionValidator HIVE_OPTIMIZE_SCAN_WITH_NATIVE_READERS_VALIDATOR =
new BooleanValidator(HIVE_OPTIMIZE_SCAN_WITH_NATIVE_READERS);
public static final String HIVE_OPTIMIZE_PARQUET_SCAN_WITH_NATIVE_READER = "store.hive.parquet.optimize_scan_with_native_readers";
Copy link

@gparai gparai Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the naming consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old name is not right in order of using several native readers for Hive plugin.
But I will keep the old property name as Deprecated for backward compatibility and will leave a comment that it should be removed starting from next 1.15.0 release.

The new property name is also introduced, so both of them can enable parquet native reader.

@gparai
Copy link

gparai commented Jun 20, 2018

The rest of the changes LGTM.

Copy link
Member Author

@vdiravka vdiravka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gparai @vrozov I have addressed your comments and have made changes.

security.admin.users: "%drill_process_user%",
store.format: "parquet",
store.hive.optimize_scan_with_native_readers: false,
store.hive.parquet.optimize_scan_with_native_readers: false,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a compulsion to change the name, since early it was only one native reader. And to control all native readers with one property is not possible.

So i have decided to keep the old property name as Deprecated for backward compatibility and to leave a comment that it should be removed starting from next 1.15.0 release.

The new property name is also introduced, so both of them can enable parquet native reader.

import org.apache.drill.exec.store.mapr.db.binary.MapRDBBinaryTable;
import org.apache.hadoop.fs.Path;

import java.io.IOException;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Just changed com.mapr.fs.MapRFileStatus place, it should be near com.mapr.fs.tables.TableProperties

</exclusion>
</exclusions>
</dependency>
<dependency>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is contrib/format-maprdb module from default profile.
Therefore it is used here in default profile too.
Sources from this dependency is used in ConvertHiveMapRDBJsonScanToDrillMapRDBJsonScan.java.

}

public Collection<InputSplit> getInputSplits() {
public List<InputSplit> getInputSplits() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public static final String HIVE_OPTIMIZE_SCAN_WITH_NATIVE_READERS = "store.hive.optimize_scan_with_native_readers";
public static final OptionValidator HIVE_OPTIMIZE_SCAN_WITH_NATIVE_READERS_VALIDATOR =
new BooleanValidator(HIVE_OPTIMIZE_SCAN_WITH_NATIVE_READERS);
public static final String HIVE_OPTIMIZE_PARQUET_SCAN_WITH_NATIVE_READER = "store.hive.parquet.optimize_scan_with_native_readers";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old name is not right in order of using several native readers for Hive plugin.
But I will keep the old property name as Deprecated for backward compatibility and will leave a comment that it should be removed starting from next 1.15.0 release.

The new property name is also introduced, so both of them can enable parquet native reader.

Copy link
Member

@arina-ielchiieva arina-ielchiieva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, a few minor comments though.

}

/**
* This method allows to cehck whether the Hive Table contains
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check


// TODO: We need to add a feature that enables storage plugins to add their own options. Currently we have to declare
// in core which is not right. Move this option and above two mongo plugin related options once we have the feature.
@Deprecated // it should be removed starting from next Drill 1.15.0 release
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file Jira for this and add in comment.

// in core which is not right. Move this option and above two mongo plugin related options once we have the feature.
@Deprecated // it should be removed starting from next Drill 1.15.0 release
public static final String HIVE_OPTIMIZE_SCAN_WITH_NATIVE_READERS = "store.hive.optimize_scan_with_native_readers";
@Deprecated // it should be removed starting from next Drill 1.15.0 release
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file Jira for this and add in comment.

}

/**
* Rule is matched when all of the following match:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use formatting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for catching this.

// we need to read path of only one to get file path
assert iterator.hasNext();
InputSplit split = iterator.next();
InputSplit split = logicalInputSplit.getInputSplits().get(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, need to revert. I have left assert deliberately.

*/
private SchemaPath replaceOverriddenSchemaPath(Map<String, String> parameters, SchemaPath colNameSchemaPath) {
String hiveColumnName = colNameSchemaPath.getRootSegmentPath();
if (hiveColumnName.equals(parameters.get(MAPRDB_COLUMN_ID))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ternary operator?

* @return original column name
*/
private String replaceOverriddenColumnId(Map<String, String> parameters, String colName) {
return Objects.equals(colName, parameters.get(MAPRDB_COLUMN_ID)) ? ID_KEY : colName;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Objects equals will consider two null equal, will this case be OK for your changes?

Copy link
Member Author

@vdiravka vdiravka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arina Thanks for catching that points.
I have addressed them and filed Jira ticket for removing old Parquet Native Reader option name:
https://issues.apache.org/jira/browse/DRILL-6527

}

/**
* Rule is matched when all of the following match:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for catching this.

@arina-ielchiieva
Copy link
Member

+1, LGTM.

vdiravka added a commit to vdiravka/drill that referenced this pull request Jun 22, 2018
@gparai
Copy link

gparai commented Jun 22, 2018

@vdiravka thanks for making the changes. +1 LGTM.

vdiravka added a commit to vdiravka/drill that referenced this pull request Jun 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants