HIVE-15112: Implement Parquet vectorization reader for Struct type by winningsix · Pull Request #116 · apache/hive

winningsix · 2016-11-21T04:55:55Z

Refactor UT

sunchao · 2016-11-28T17:03:15Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetColumnReader.java

@@ -0,0 +1,13 @@
+package org.apache.hadoop.hive.ql.io.parquet.vector;


License header?

sunchao · 2016-11-28T17:05:14Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetColumnReader.java

+
+import java.io.IOException;
+
+public interface VectorizedParquetColumnReader {


Add some comment on the readBatch method? Also, from the method signature it seems it should not be restricted to only Parquet. How about VectorizedColumnReader?

sunchao · 2016-11-28T17:05:52Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetMapReader.java

+
+import java.io.IOException;
+
+public class VectorizedParquetMapReader implements VectorizedParquetColumnReader{


Is this really necessary? this is the same as VectorizedParquetColumnReader.

sunchao · 2016-11-28T17:06:43Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetMapReader.java

+    int total,
+    ColumnVector column,
+    TypeInfo columnType) throws IOException {
+


sunchao · 2016-11-28T17:09:32Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java

+    List<ColumnDescriptor> columns) {
+    List<ColumnDescriptor> res = new ArrayList<>();
+    for (ColumnDescriptor descriptor : columns) {
+      if (type.getName().equals(descriptor.getPath()[depth])) {


What if the path length is smaller than depth? Will this crash?

It happens only when schema is corrupted. Addressing this by adding a check before this if block.

sunchao · 2016-11-28T17:12:21Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java

+          fieldReaders.add(r);
+        }
+      }
+      if (fieldReaders.size() > 0) {


What if fieldReaders.size() is not equal to fieldTypes.size(). Can this be handled?

sunchao · 2016-11-28T17:16:39Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java

+      PrimitiveTypeInfo primitiveColumnType = (PrimitiveTypeInfo) columnType;
+      readBatchForPrimitiveType(num, column, primitiveColumnType, rowId);
+      break;
+    case LIST:


If this is a primitive column reader, why it should read complex types?

sunchao · 2016-11-30T18:58:02Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java

+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.io.parquet.vector;


Remove unused imports.

sunchao · 2016-11-30T18:59:17Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java

+    case INTERVAL_DAY_TIME:
+    case TIMESTAMP:
+    default:
+      throw new IOException("Unsupported");


Better to improve this message, e.g., include the specific type involved.

sunchao · 2016-11-30T18:59:58Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java

+      readFloats(num, (DoubleColumnVector) column, rowId);
+      break;
+    case DECIMAL:
+      readDecimal(num, (DecimalColumnVector) column, rowId);


sunchao · 2016-11-30T19:03:22Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java

+    List<ColumnDescriptor> columns) throws ParquetRuntimeException {
+    List<ColumnDescriptor> res = new ArrayList<>();
+    for (ColumnDescriptor descriptor : columns) {
+      if (depth > descriptor.getPath().length) {


sunchao · 2016-11-30T19:04:49Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java

+    MessageType schema,
+    boolean skipTimestampConversion) throws IOException {
+    return buildVectorizedParquetReader(typeInfo, type, pages, schema.getColumns(), skipTimestampConversion,
+      0);


nit: can we put this into the same line above? for easier reading.

sunchao · 2016-11-30T19:15:22Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedParquetRecordReader.java

      columnReaders[i] =
-        new VectorizedColumnReader(columns.get(i), pages.getPageReader(columns.get(i)),
-          skipTimestampConversion, types.get(i));
+        buildVectorizedParquetReader(columnTypesList.get(indexColumnsWanted.get(i)), types.get(i),


is it possible that indexColumnsWanted could be empty?

sunchao · 2016-11-30T19:38:05Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedStructReader.java

+
+public class VectorizedStructReader implements VectorizedColumnReader {
+
+  List<VectorizedColumnReader> fieldReaders;


make this private?

sunchao · 2016-11-30T19:43:38Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedStructReader.java

+      fieldReaders.get(i)
+        .readBatch(total, vectors[i], structTypeInfo.getAllStructFieldTypeInfos().get(i));
+      structColumnVector.isRepeating = structColumnVector.isRepeating && vectors[i].isRepeating;
+      for (int j = 0; j < vectors[i].isNull.length; j++) {


I think there's a difference between null struct versus struct with null fields. Seems this treat the two cases as the same. Do we need to differentiate them?

sunchao · 2016-11-30T19:44:26Z

ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedColumnReader.java

-      fs.delete(file, true);
-    }
-  }
+public class TestVectorizedColumnReader extends TestVectorizedColumnReaderBase{


space before {

sunchao · 2016-11-30T19:46:01Z

ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedColumnReaderBase.java

+      reader.close();
+    }
+  }
+


I think we talked about testing reading decimal. Should we add in this patch?

sunchao

Thanks @winningsix ! This PR looks good to me. Just a few minor comments before checking it in.

sunchao · 2016-12-07T23:54:45Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedStructReader.java

+import java.io.IOException;
+import java.util.List;
+
+public class VectorizedStructReader implements VectorizedColumnReader {


nit: rename this to VectorizedStructColumnReader? to be consistent with VectorizedColumnReader and VectorizedPrimitiveColumnReader.

sunchao · 2016-12-08T00:00:53Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedStructReader.java

+
+public class VectorizedStructReader implements VectorizedColumnReader {
+
+  private List<VectorizedColumnReader> fieldReaders;


nit: mark this as final?

sunchao · 2016-12-08T00:02:28Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedStructReader.java

+        .readBatch(total, vectors[i], structTypeInfo.getAllStructFieldTypeInfos().get(i));
+      structColumnVector.isRepeating = structColumnVector.isRepeating && vectors[i].isRepeating;
+
+      for (int j = 0; j < vectors[i].isNull.length; j++) {


should we set structColumnVector.noNulls as well?

sunchao · 2016-12-08T06:22:05Z

ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedStructColumnReader.java

+      for (int j = 0; j < vectors[i].isNull.length; j++) {
+        structColumnVector.isNull[j] =
+          (i == 0) ? vectors[i].isNull[j] : structColumnVector.isNull[j] && vectors[i].isNull[j];
+        structColumnVector.noNulls = (i == 0) ? structColumnVector.isNull[j] :


Hmm.. why this needs to be in the inner loop. Can you just do:
structColumnVector.noNulls = (i == 0) ? vectors[i].noNulls : structColumnVector.noNulls && vectors[i].noNulls;?

Ferdinand Xu added 3 commits November 21, 2016 03:54

HIVE-15112: Implement Parquet vectorization reader for Complex types

8251750

Refactor UT

Add license header

d90013b

Remove list implementation

0a6e822

sunchao reviewed Nov 28, 2016

View reviewed changes

Ferdinand Xu added 2 commits November 29, 2016 01:26

Refine code

2c27477

Address comments

cf08512

winningsix changed the title ~~HIVE-15112: Implement Parquet vectorization reader for Complex types~~ HIVE-15112: Implement Parquet vectorization reader for Struct type Nov 30, 2016

sunchao reviewed Nov 30, 2016

View reviewed changes

Ferdinand Xu added 2 commits December 6, 2016 14:41

Add Decimal test

76717d1

Address further comments

e0a00ed

sunchao reviewed Dec 8, 2016

View reviewed changes

Fix failed cases

c1c1755

asfgit closed this in 9a524ad Dec 10, 2016

		@@ -0,0 +1,13 @@
		package org.apache.hadoop.hive.ql.io.parquet.vector;


		import java.io.IOException;

		public interface VectorizedParquetColumnReader {


		import java.io.IOException;

		public class VectorizedParquetMapReader implements VectorizedParquetColumnReader{


		public class VectorizedStructReader implements VectorizedColumnReader {

		List<VectorizedColumnReader> fieldReaders;


		public class VectorizedStructReader implements VectorizedColumnReader {

		private List<VectorizedColumnReader> fieldReaders;

Conversation

winningsix commented Nov 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants