New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HAWQ-1446: Introduce vectorized profile for ORC. #1225
Conversation
41c1a6a
to
6a063e8
Compare
isVectorizedResolver = ArrayUtils.contains(Class.forName(inputData.getResolver()).getInterfaces(), ReadVectorizedResolver.class); | ||
} catch (ClassNotFoundException e) { | ||
LOG.error("Unable to load resolver class: " + e.getMessage()); | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for this line, it will return at the end of the function with default false value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks
@@ -289,7 +289,7 @@ private void fetchMetaData(HiveTablePartition tablePartition, boolean hasComplex | |||
if (inputData.getProfile() != null) { | |||
// evaluate optimal profile based on file format if profile was explicitly specified in url | |||
// if user passed accessor+fragmenter+resolver - use them | |||
profile = ProfileFactory.get(fformat, hasComplexTypes); | |||
profile = ProfileFactory.get(fformat, hasComplexTypes, inputData.getProfile()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getProfile() is called twice (in if statement and here, its better to call once and then evaluate and reuse the variable)
@@ -136,7 +136,7 @@ public HiveMetadataFetcher(InputData md) { | |||
private OutputFormat getOutputFormat(String inputFormat, boolean hasComplexTypes) throws Exception { | |||
OutputFormat outputFormat = null; | |||
InputFormat<?, ?> fformat = HiveDataFragmenter.makeInputFormat(inputFormat, jobConf); | |||
String profile = ProfileFactory.get(fformat, hasComplexTypes); | |||
String profile = ProfileFactory.get(fformat, hasComplexTypes, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
passing explicit null params should be avoided, if possible, override the function if more/less params are desired.
} | ||
|
||
/** | ||
* This method updated reader optionst to include projected columns only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "optionst"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed
* One batch is 1024 rows of all projected columns | ||
* | ||
*/ | ||
public class HiveORCBatchAccessor extends Plugin implements ReadAccessor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be useful if it extended the HiveORCAccessor and overwrite functions ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, updated
/* process all rows from current batch for given column */ | ||
for (int rowIndex = 0; rowIndex < vectorizedBatch.size; rowIndex++) { | ||
if (vectorizedBatch.cols[columnIndex] != null && !vectorizedBatch.cols[columnIndex].isNull[rowIndex]) { | ||
writableObject = vectorizedBatch.cols[columnIndex].getWritableObject(rowIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vectorizedBatch.cols[columnIndex] is used 3 times, substitute with variable ?
@@ -149,9 +149,10 @@ public static ReadAccessor getFileAccessor(InputData inputData) | |||
inputData.getAccessor(), inputData); | |||
} | |||
|
|||
public static ReadResolver getFieldsResolver(InputData inputData) | |||
@SuppressWarnings("unchecked") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch, can you make Utilities.createAnyInstance templetized instead ?
|
||
public class ReadVectorizedBridge implements Bridge { | ||
|
||
ReadAccessor fileAccessor = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not private members ?
if (batch == null) { | ||
output = outputBuilder.getPartialLine(); | ||
if (output != null) { | ||
//LOG.warn("A partial record in the end of the fragment"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented lines ?
@@ -101,6 +101,17 @@ under the License. | |||
<outputFormat>org.apache.hawq.pxf.service.io.GPDBWritable</outputFormat> | |||
</plugins> | |||
</profile> | |||
<profile> | |||
<name>HiveVectorizedORC</name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like "batch" and "vectorized" are used interchangeably, should we use just one term ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed all classes to use "vectorized"
PrimitiveObjectInspector poi = (PrimitiveObjectInspector ) oi; | ||
resolvePrimitiveColumn(columnIndex, oi, vectorizedBatch); | ||
} else { | ||
throw new UnsupportedTypeException("Unable to resolve column index:" + columnIndex + ". Only primitive types are supported."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we catch this error upfront to check if have any non primitive type column in the schema when we use BatchResolver instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the very first place we are starting to process columns I don't think w should double validate them.
OneField field = null; | ||
Writable writableObject = null; | ||
/* process all rows from current batch for given column */ | ||
for (int rowIndex = 0; rowIndex < vectorizedBatch.size; rowIndex++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we do know that the whole all the writableObjects are of the same type, why don't we simply extract all values in the column into a Writable [] and resolve the entire writable array at once using just one switch case to determine the field type and resolve each writable value in one go ?
@@ -0,0 +1,126 @@ | |||
package org.apache.hawq.pxf.service; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadVectorizedBridge looks very similar to ReadBridge except for getNext() function. Please refactor both classes to avoid duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, extended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Over all looks good. You can address the type caste optimization and the class refactor as an incremental update to the vectorized ORC profile
11e9436
to
53c40b7
Compare
@@ -34,31 +34,31 @@ | |||
public void get() throws Exception { | |||
|
|||
// For TextInputFormat when table has no complex types, HiveText profile should be used | |||
String profileName = ProfileFactory.get(new TextInputFormat(), false); | |||
String profileName = ProfileFactory.get(new TextInputFormat(), false, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can revert back these changes now that the function with 2 arguments is back, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
@@ -137,6 +137,18 @@ public Writable getErrorOutput(Exception ex) throws Exception { | |||
return outputList; | |||
} | |||
|
|||
public LinkedList<Writable> makeVectorizedOutput(List<List<OneField>> recordsBatch) throws BadRecordException { | |||
outputList.clear(); | |||
for (List<OneField> record : recordsBatch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no null checks necessary for recordsBatch and record ?
a2d7fa5
to
79c688e
Compare
Work still in progress, want to get earlier feedback.