Skip to content

HIVE-15112: Implement Parquet vectorization reader for Struct type#116

Closed
winningsix wants to merge 8 commits intoapache:masterfrom
winningsix:HIVE-15112-new
Closed

HIVE-15112: Implement Parquet vectorization reader for Struct type#116
winningsix wants to merge 8 commits intoapache:masterfrom
winningsix:HIVE-15112-new

Conversation

@winningsix
Copy link

Refactor UT

@@ -0,0 +1,13 @@
package org.apache.hadoop.hive.ql.io.parquet.vector;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License header?


import java.io.IOException;

public interface VectorizedParquetColumnReader {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comment on the readBatch method? Also, from the method signature it seems it should not be restricted to only Parquet. How about VectorizedColumnReader?


import java.io.IOException;

public class VectorizedParquetMapReader implements VectorizedParquetColumnReader{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really necessary? this is the same as VectorizedParquetColumnReader.

int total,
ColumnVector column,
TypeInfo columnType) throws IOException {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra line

List<ColumnDescriptor> columns) {
List<ColumnDescriptor> res = new ArrayList<>();
for (ColumnDescriptor descriptor : columns) {
if (type.getName().equals(descriptor.getPath()[depth])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the path length is smaller than depth? Will this crash?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens only when schema is corrupted. Addressing this by adding a check before this if block.

fieldReaders.add(r);
}
}
if (fieldReaders.size() > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if fieldReaders.size() is not equal to fieldTypes.size(). Can this be handled?

PrimitiveTypeInfo primitiveColumnType = (PrimitiveTypeInfo) columnType;
readBatchForPrimitiveType(num, column, primitiveColumnType, rowId);
break;
case LIST:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a primitive column reader, why it should read complex types?

@winningsix winningsix changed the title HIVE-15112: Implement Parquet vectorization reader for Complex types HIVE-15112: Implement Parquet vectorization reader for Struct type Nov 30, 2016
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.hive.ql.io.parquet.vector;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused imports.

case INTERVAL_DAY_TIME:
case TIMESTAMP:
default:
throw new IOException("Unsupported");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to improve this message, e.g., include the specific type involved.

readFloats(num, (DoubleColumnVector) column, rowId);
break;
case DECIMAL:
readDecimal(num, (DecimalColumnVector) column, rowId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO break?

List<ColumnDescriptor> columns) throws ParquetRuntimeException {
List<ColumnDescriptor> res = new ArrayList<>();
for (ColumnDescriptor descriptor : columns) {
if (depth > descriptor.getPath().length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>= ?

MessageType schema,
boolean skipTimestampConversion) throws IOException {
return buildVectorizedParquetReader(typeInfo, type, pages, schema.getColumns(), skipTimestampConversion,
0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we put this into the same line above? for easier reading.

columnReaders[i] =
new VectorizedColumnReader(columns.get(i), pages.getPageReader(columns.get(i)),
skipTimestampConversion, types.get(i));
buildVectorizedParquetReader(columnTypesList.get(indexColumnsWanted.get(i)), types.get(i),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that indexColumnsWanted could be empty?


public class VectorizedStructReader implements VectorizedColumnReader {

List<VectorizedColumnReader> fieldReaders;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this private?

fieldReaders.get(i)
.readBatch(total, vectors[i], structTypeInfo.getAllStructFieldTypeInfos().get(i));
structColumnVector.isRepeating = structColumnVector.isRepeating && vectors[i].isRepeating;
for (int j = 0; j < vectors[i].isNull.length; j++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a difference between null struct versus struct with null fields. Seems this treat the two cases as the same. Do we need to differentiate them?

fs.delete(file, true);
}
}
public class TestVectorizedColumnReader extends TestVectorizedColumnReaderBase{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before {

reader.close();
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we talked about testing reading decimal. Should we add in this patch?

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @winningsix ! This PR looks good to me. Just a few minor comments before checking it in.

import java.io.IOException;
import java.util.List;

public class VectorizedStructReader implements VectorizedColumnReader {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename this to VectorizedStructColumnReader? to be consistent with VectorizedColumnReader and VectorizedPrimitiveColumnReader.


public class VectorizedStructReader implements VectorizedColumnReader {

private List<VectorizedColumnReader> fieldReaders;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mark this as final?

.readBatch(total, vectors[i], structTypeInfo.getAllStructFieldTypeInfos().get(i));
structColumnVector.isRepeating = structColumnVector.isRepeating && vectors[i].isRepeating;

for (int j = 0; j < vectors[i].isNull.length; j++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we set structColumnVector.noNulls as well?

for (int j = 0; j < vectors[i].isNull.length; j++) {
structColumnVector.isNull[j] =
(i == 0) ? vectors[i].isNull[j] : structColumnVector.isNull[j] && vectors[i].isNull[j];
structColumnVector.noNulls = (i == 0) ? structColumnVector.isNull[j] :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. why this needs to be in the inner loop. Can you just do:
structColumnVector.noNulls = (i == 0) ? vectors[i].noNulls : structColumnVector.noNulls && vectors[i].noNulls;?

@asfgit asfgit closed this in 9a524ad Dec 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants