DRILL-6454: Native MapR DB plugin support for Hive MapR-DB json table#1314
DRILL-6454: Native MapR DB plugin support for Hive MapR-DB json table#1314vdiravka wants to merge 2 commits intoapache:masterfrom
Conversation
|
@vdiravka do you have any writeup/dspec for this feature? |
|
@gparai I have described the approach in the Jira description and described the changes in the PR description. There is no need in "Design document" for this feature, since it is too small. |
|
@gparai can you please complete the review soon? |
| import org.apache.drill.exec.store.mapr.db.binary.MapRDBBinaryTable; | ||
| import org.apache.hadoop.fs.Path; | ||
|
|
||
| import java.io.IOException; |
There was a problem hiding this comment.
Done. Just changed com.mapr.fs.MapRFileStatus place, it should be near com.mapr.fs.tables.TableProperties
| security.admin.users: "%drill_process_user%", | ||
| store.format: "parquet", | ||
| store.hive.optimize_scan_with_native_readers: false, | ||
| store.hive.parquet.optimize_scan_with_native_readers: false, |
There was a problem hiding this comment.
Is it necessary to maintain backward compatibility with prior versions?
There was a problem hiding this comment.
It is a compulsion to change the name, since early it was only one native reader. And to control all native readers with one property is not possible.
So i have decided to keep the old property name as Deprecated for backward compatibility and to leave a comment that it should be removed starting from next 1.15.0 release.
The new property name is also introduced, so both of them can enable parquet native reader.
| } | ||
|
|
||
| public Collection<InputSplit> getInputSplits() { | ||
| public List<InputSplit> getInputSplits() { |
There was a problem hiding this comment.
Why Collection->List? Where order comes from?
There was a problem hiding this comment.
There is no need to use Collection. We know exactly it is a list:
There was a problem hiding this comment.
It is not a question whether it is known to be a List or an ArrayList (it is probably known that it is an ArrayList). List provides certain behavior (allows duplicate and provides insertion order guarantee). Do you need any of those?
There was a problem hiding this comment.
Small, but right remark here. Thanks. I have returned this to Collection. There is no need to use List.
I think Collection is used because of a Map#values() returns it in some places.
From the code it looks like it is enough (at least now) the API of Collection for logicalInputSplit.
Also I found other places with Collection<FileSplit> logicalInputSplit and List<LogicalInputSplit> getInputSplits(). Possibly they should be leaded to the common style.
| </exclusion> | ||
| </exclusions> | ||
| </dependency> | ||
| <dependency> |
There was a problem hiding this comment.
Where is the dependency introduced?
There was a problem hiding this comment.
This is contrib/format-maprdb module from default profile.
Therefore it is used here in default profile too.
Sources from this dependency is used in ConvertHiveMapRDBJsonScanToDrillMapRDBJsonScan.java.
There was a problem hiding this comment.
OK, I see. I was confused due to Drill practice to use the same package name in different modules/jars. This practice is not common and far from best practices.
There was a problem hiding this comment.
Could you clarify what package name would be better to use?
Do you mean short names instead of full ones for the common packages?
There was a problem hiding this comment.
No, I don't refer to using short names. I refer to using "org.apache.drill.exec" package name outside of "exec". As it is already widely used in Drill I don't suggest to change it in this PR.
| public static final String HIVE_OPTIMIZE_SCAN_WITH_NATIVE_READERS = "store.hive.optimize_scan_with_native_readers"; | ||
| public static final OptionValidator HIVE_OPTIMIZE_SCAN_WITH_NATIVE_READERS_VALIDATOR = | ||
| new BooleanValidator(HIVE_OPTIMIZE_SCAN_WITH_NATIVE_READERS); | ||
| public static final String HIVE_OPTIMIZE_PARQUET_SCAN_WITH_NATIVE_READER = "store.hive.parquet.optimize_scan_with_native_readers"; |
There was a problem hiding this comment.
The old name is not right in order of using several native readers for Hive plugin.
But I will keep the old property name as Deprecated for backward compatibility and will leave a comment that it should be removed starting from next 1.15.0 release.
The new property name is also introduced, so both of them can enable parquet native reader.
|
The rest of the changes LGTM. |
| security.admin.users: "%drill_process_user%", | ||
| store.format: "parquet", | ||
| store.hive.optimize_scan_with_native_readers: false, | ||
| store.hive.parquet.optimize_scan_with_native_readers: false, |
There was a problem hiding this comment.
It is a compulsion to change the name, since early it was only one native reader. And to control all native readers with one property is not possible.
So i have decided to keep the old property name as Deprecated for backward compatibility and to leave a comment that it should be removed starting from next 1.15.0 release.
The new property name is also introduced, so both of them can enable parquet native reader.
| import org.apache.drill.exec.store.mapr.db.binary.MapRDBBinaryTable; | ||
| import org.apache.hadoop.fs.Path; | ||
|
|
||
| import java.io.IOException; |
There was a problem hiding this comment.
Done. Just changed com.mapr.fs.MapRFileStatus place, it should be near com.mapr.fs.tables.TableProperties
| </exclusion> | ||
| </exclusions> | ||
| </dependency> | ||
| <dependency> |
There was a problem hiding this comment.
This is contrib/format-maprdb module from default profile.
Therefore it is used here in default profile too.
Sources from this dependency is used in ConvertHiveMapRDBJsonScanToDrillMapRDBJsonScan.java.
| } | ||
|
|
||
| public Collection<InputSplit> getInputSplits() { | ||
| public List<InputSplit> getInputSplits() { |
There was a problem hiding this comment.
There is no need to use Collection. We know exactly it is a list:
| public static final String HIVE_OPTIMIZE_SCAN_WITH_NATIVE_READERS = "store.hive.optimize_scan_with_native_readers"; | ||
| public static final OptionValidator HIVE_OPTIMIZE_SCAN_WITH_NATIVE_READERS_VALIDATOR = | ||
| new BooleanValidator(HIVE_OPTIMIZE_SCAN_WITH_NATIVE_READERS); | ||
| public static final String HIVE_OPTIMIZE_PARQUET_SCAN_WITH_NATIVE_READER = "store.hive.parquet.optimize_scan_with_native_readers"; |
There was a problem hiding this comment.
The old name is not right in order of using several native readers for Hive plugin.
But I will keep the old property name as Deprecated for backward compatibility and will leave a comment that it should be removed starting from next 1.15.0 release.
The new property name is also introduced, so both of them can enable parquet native reader.
arina-ielchiieva
left a comment
There was a problem hiding this comment.
Overall looks good, a few minor comments though.
| } | ||
|
|
||
| /** | ||
| * This method allows to cehck whether the Hive Table contains |
|
|
||
| // TODO: We need to add a feature that enables storage plugins to add their own options. Currently we have to declare | ||
| // in core which is not right. Move this option and above two mongo plugin related options once we have the feature. | ||
| @Deprecated // it should be removed starting from next Drill 1.15.0 release |
There was a problem hiding this comment.
Please file Jira for this and add in comment.
| // in core which is not right. Move this option and above two mongo plugin related options once we have the feature. | ||
| @Deprecated // it should be removed starting from next Drill 1.15.0 release | ||
| public static final String HIVE_OPTIMIZE_SCAN_WITH_NATIVE_READERS = "store.hive.optimize_scan_with_native_readers"; | ||
| @Deprecated // it should be removed starting from next Drill 1.15.0 release |
There was a problem hiding this comment.
Please file Jira for this and add in comment.
| } | ||
|
|
||
| /** | ||
| * Rule is matched when all of the following match: |
There was a problem hiding this comment.
Please use formatting.
There was a problem hiding this comment.
Done. Thanks for catching this.
| // we need to read path of only one to get file path | ||
| assert iterator.hasNext(); | ||
| InputSplit split = iterator.next(); | ||
| InputSplit split = logicalInputSplit.getInputSplits().get(0); |
There was a problem hiding this comment.
No, need to revert. I have left assert deliberately.
| */ | ||
| private SchemaPath replaceOverriddenSchemaPath(Map<String, String> parameters, SchemaPath colNameSchemaPath) { | ||
| String hiveColumnName = colNameSchemaPath.getRootSegmentPath(); | ||
| if (hiveColumnName.equals(parameters.get(MAPRDB_COLUMN_ID))) { |
| * @return original column name | ||
| */ | ||
| private String replaceOverriddenColumnId(Map<String, String> parameters, String colName) { | ||
| return Objects.equals(colName, parameters.get(MAPRDB_COLUMN_ID)) ? ID_KEY : colName; |
There was a problem hiding this comment.
Objects equals will consider two null equal, will this case be OK for your changes?
vdiravka
left a comment
There was a problem hiding this comment.
@arina Thanks for catching that points.
I have addressed them and filed Jira ticket for removing old Parquet Native Reader option name:
https://issues.apache.org/jira/browse/DRILL-6527
| } | ||
|
|
||
| /** | ||
| * Rule is matched when all of the following match: |
There was a problem hiding this comment.
Done. Thanks for catching this.
|
+1, LGTM. |
|
@vdiravka thanks for making the changes. +1 LGTM. |
StoragePluginOptimizerRulefor reading MapR-DB JSON tables by Drill native reader is added.The rule can allow in planning stage to replace
HiveScanwith Drill nativeJsonTableGroupScan.ConvertHiveMapRDBJsonScanToDrillMapRDBJsonScanandConvertHiveParquetScanToDrillParquetScanare factored out intoHiveUtilities.drill-format-maprforhive-coreis added.Note: