Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HAWQ-1446: Introduce vectorized profile for ORC. #1225

Closed
wants to merge 2 commits into from

Conversation

avocader
Copy link
Contributor

Work still in progress, want to get earlier feedback.

@avocader avocader force-pushed the HAWQ-1446 branch 2 times, most recently from 41c1a6a to 6a063e8 Compare May 1, 2017 20:56
isVectorizedResolver = ArrayUtils.contains(Class.forName(inputData.getResolver()).getInterfaces(), ReadVectorizedResolver.class);
} catch (ClassNotFoundException e) {
LOG.error("Unable to load resolver class: " + e.getMessage());
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this line, it will return at the end of the function with default false value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks

@@ -289,7 +289,7 @@ private void fetchMetaData(HiveTablePartition tablePartition, boolean hasComplex
if (inputData.getProfile() != null) {
// evaluate optimal profile based on file format if profile was explicitly specified in url
// if user passed accessor+fragmenter+resolver - use them
profile = ProfileFactory.get(fformat, hasComplexTypes);
profile = ProfileFactory.get(fformat, hasComplexTypes, inputData.getProfile());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getProfile() is called twice (in if statement and here, its better to call once and then evaluate and reuse the variable)

@@ -136,7 +136,7 @@ public HiveMetadataFetcher(InputData md) {
private OutputFormat getOutputFormat(String inputFormat, boolean hasComplexTypes) throws Exception {
OutputFormat outputFormat = null;
InputFormat<?, ?> fformat = HiveDataFragmenter.makeInputFormat(inputFormat, jobConf);
String profile = ProfileFactory.get(fformat, hasComplexTypes);
String profile = ProfileFactory.get(fformat, hasComplexTypes, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passing explicit null params should be avoided, if possible, override the function if more/less params are desired.

}

/**
* This method updated reader optionst to include projected columns only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "optionst"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed

* One batch is 1024 rows of all projected columns
*
*/
public class HiveORCBatchAccessor extends Plugin implements ReadAccessor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be useful if it extended the HiveORCAccessor and overwrite functions ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated

/* process all rows from current batch for given column */
for (int rowIndex = 0; rowIndex < vectorizedBatch.size; rowIndex++) {
if (vectorizedBatch.cols[columnIndex] != null && !vectorizedBatch.cols[columnIndex].isNull[rowIndex]) {
writableObject = vectorizedBatch.cols[columnIndex].getWritableObject(rowIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vectorizedBatch.cols[columnIndex] is used 3 times, substitute with variable ?

@@ -149,9 +149,10 @@ public static ReadAccessor getFileAccessor(InputData inputData)
inputData.getAccessor(), inputData);
}

public static ReadResolver getFieldsResolver(InputData inputData)
@SuppressWarnings("unchecked")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch, can you make Utilities.createAnyInstance templetized instead ?


public class ReadVectorizedBridge implements Bridge {

ReadAccessor fileAccessor = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not private members ?

if (batch == null) {
output = outputBuilder.getPartialLine();
if (output != null) {
//LOG.warn("A partial record in the end of the fragment");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented lines ?

@@ -101,6 +101,17 @@ under the License.
<outputFormat>org.apache.hawq.pxf.service.io.GPDBWritable</outputFormat>
</plugins>
</profile>
<profile>
<name>HiveVectorizedORC</name>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like "batch" and "vectorized" are used interchangeably, should we use just one term ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed all classes to use "vectorized"

PrimitiveObjectInspector poi = (PrimitiveObjectInspector ) oi;
resolvePrimitiveColumn(columnIndex, oi, vectorizedBatch);
} else {
throw new UnsupportedTypeException("Unable to resolve column index:" + columnIndex + ". Only primitive types are supported.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we catch this error upfront to check if have any non primitive type column in the schema when we use BatchResolver instead ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the very first place we are starting to process columns I don't think w should double validate them.

OneField field = null;
Writable writableObject = null;
/* process all rows from current batch for given column */
for (int rowIndex = 0; rowIndex < vectorizedBatch.size; rowIndex++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do know that the whole all the writableObjects are of the same type, why don't we simply extract all values in the column into a Writable [] and resolve the entire writable array at once using just one switch case to determine the field type and resolve each writable value in one go ?

@@ -0,0 +1,126 @@
package org.apache.hawq.pxf.service;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadVectorizedBridge looks very similar to ReadBridge except for getNext() function. Please refactor both classes to avoid duplication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, extended.

Copy link
Contributor

@shivzone shivzone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over all looks good. You can address the type caste optimization and the class refactor as an incremental update to the vectorized ORC profile

@avocader avocader force-pushed the HAWQ-1446 branch 8 times, most recently from 11e9436 to 53c40b7 Compare May 26, 2017 21:54
@@ -34,31 +34,31 @@
public void get() throws Exception {

// For TextInputFormat when table has no complex types, HiveText profile should be used
String profileName = ProfileFactory.get(new TextInputFormat(), false);
String profileName = ProfileFactory.get(new TextInputFormat(), false, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can revert back these changes now that the function with 2 arguments is back, right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@@ -137,6 +137,18 @@ public Writable getErrorOutput(Exception ex) throws Exception {
return outputList;
}

public LinkedList<Writable> makeVectorizedOutput(List<List<OneField>> recordsBatch) throws BadRecordException {
outputList.clear();
for (List<OneField> record : recordsBatch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no null checks necessary for recordsBatch and record ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants