HIVE-15112: Implement Parquet vectorization reader for Struct type#116
HIVE-15112: Implement Parquet vectorization reader for Struct type#116winningsix wants to merge 8 commits intoapache:masterfrom
Conversation
| @@ -0,0 +1,13 @@ | |||
| package org.apache.hadoop.hive.ql.io.parquet.vector; | |||
|
|
||
| import java.io.IOException; | ||
|
|
||
| public interface VectorizedParquetColumnReader { |
There was a problem hiding this comment.
Add some comment on the readBatch method? Also, from the method signature it seems it should not be restricted to only Parquet. How about VectorizedColumnReader?
|
|
||
| import java.io.IOException; | ||
|
|
||
| public class VectorizedParquetMapReader implements VectorizedParquetColumnReader{ |
There was a problem hiding this comment.
Is this really necessary? this is the same as VectorizedParquetColumnReader.
| int total, | ||
| ColumnVector column, | ||
| TypeInfo columnType) throws IOException { | ||
|
|
| List<ColumnDescriptor> columns) { | ||
| List<ColumnDescriptor> res = new ArrayList<>(); | ||
| for (ColumnDescriptor descriptor : columns) { | ||
| if (type.getName().equals(descriptor.getPath()[depth])) { |
There was a problem hiding this comment.
What if the path length is smaller than depth? Will this crash?
There was a problem hiding this comment.
It happens only when schema is corrupted. Addressing this by adding a check before this if block.
| fieldReaders.add(r); | ||
| } | ||
| } | ||
| if (fieldReaders.size() > 0) { |
There was a problem hiding this comment.
What if fieldReaders.size() is not equal to fieldTypes.size(). Can this be handled?
| PrimitiveTypeInfo primitiveColumnType = (PrimitiveTypeInfo) columnType; | ||
| readBatchForPrimitiveType(num, column, primitiveColumnType, rowId); | ||
| break; | ||
| case LIST: |
There was a problem hiding this comment.
If this is a primitive column reader, why it should read complex types?
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package org.apache.hadoop.hive.ql.io.parquet.vector; |
| case INTERVAL_DAY_TIME: | ||
| case TIMESTAMP: | ||
| default: | ||
| throw new IOException("Unsupported"); |
There was a problem hiding this comment.
Better to improve this message, e.g., include the specific type involved.
| readFloats(num, (DoubleColumnVector) column, rowId); | ||
| break; | ||
| case DECIMAL: | ||
| readDecimal(num, (DecimalColumnVector) column, rowId); |
| List<ColumnDescriptor> columns) throws ParquetRuntimeException { | ||
| List<ColumnDescriptor> res = new ArrayList<>(); | ||
| for (ColumnDescriptor descriptor : columns) { | ||
| if (depth > descriptor.getPath().length) { |
| MessageType schema, | ||
| boolean skipTimestampConversion) throws IOException { | ||
| return buildVectorizedParquetReader(typeInfo, type, pages, schema.getColumns(), skipTimestampConversion, | ||
| 0); |
There was a problem hiding this comment.
nit: can we put this into the same line above? for easier reading.
| columnReaders[i] = | ||
| new VectorizedColumnReader(columns.get(i), pages.getPageReader(columns.get(i)), | ||
| skipTimestampConversion, types.get(i)); | ||
| buildVectorizedParquetReader(columnTypesList.get(indexColumnsWanted.get(i)), types.get(i), |
There was a problem hiding this comment.
is it possible that indexColumnsWanted could be empty?
|
|
||
| public class VectorizedStructReader implements VectorizedColumnReader { | ||
|
|
||
| List<VectorizedColumnReader> fieldReaders; |
| fieldReaders.get(i) | ||
| .readBatch(total, vectors[i], structTypeInfo.getAllStructFieldTypeInfos().get(i)); | ||
| structColumnVector.isRepeating = structColumnVector.isRepeating && vectors[i].isRepeating; | ||
| for (int j = 0; j < vectors[i].isNull.length; j++) { |
There was a problem hiding this comment.
I think there's a difference between null struct versus struct with null fields. Seems this treat the two cases as the same. Do we need to differentiate them?
| fs.delete(file, true); | ||
| } | ||
| } | ||
| public class TestVectorizedColumnReader extends TestVectorizedColumnReaderBase{ |
| reader.close(); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
I think we talked about testing reading decimal. Should we add in this patch?
sunchao
left a comment
There was a problem hiding this comment.
Thanks @winningsix ! This PR looks good to me. Just a few minor comments before checking it in.
| import java.io.IOException; | ||
| import java.util.List; | ||
|
|
||
| public class VectorizedStructReader implements VectorizedColumnReader { |
There was a problem hiding this comment.
nit: rename this to VectorizedStructColumnReader? to be consistent with VectorizedColumnReader and VectorizedPrimitiveColumnReader.
|
|
||
| public class VectorizedStructReader implements VectorizedColumnReader { | ||
|
|
||
| private List<VectorizedColumnReader> fieldReaders; |
| .readBatch(total, vectors[i], structTypeInfo.getAllStructFieldTypeInfos().get(i)); | ||
| structColumnVector.isRepeating = structColumnVector.isRepeating && vectors[i].isRepeating; | ||
|
|
||
| for (int j = 0; j < vectors[i].isNull.length; j++) { |
There was a problem hiding this comment.
should we set structColumnVector.noNulls as well?
| for (int j = 0; j < vectors[i].isNull.length; j++) { | ||
| structColumnVector.isNull[j] = | ||
| (i == 0) ? vectors[i].isNull[j] : structColumnVector.isNull[j] && vectors[i].isNull[j]; | ||
| structColumnVector.noNulls = (i == 0) ? structColumnVector.isNull[j] : |
There was a problem hiding this comment.
Hmm.. why this needs to be in the inner loop. Can you just do:
structColumnVector.noNulls = (i == 0) ? vectors[i].noNulls : structColumnVector.noNulls && vectors[i].noNulls;?
Refactor UT