Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-36935][SQL] Extend ParquetSchemaConverter to compute Parquet r…
…epetition & definition level ### What changes were proposed in this pull request? This PR includes the following: 1. adds a new class `ParquetColumn`, which is a wrapper on a Spark `DataType` with additional Parquet column information, including its max repetition level & definition level, path from the root schema, column descriptor if the node is a leaf, the children nodes is it is a non-leaf, etc. This is needed to support complex type in the vectorized path, where we need to assemble a column vector of complex type using these information. 2. extends `ParquetSchemaConverter` to convert from a Parquet `MessageType` to a `ParquetColumn`, mostly by piggy-backing on the existing logic. A new method `converParquetColumn` is added for this purpose, and the existing `convert` is changed to simply by calling the former. ### Why are the changes needed? In order to support complex type for the vectorized Parquet reader (see SPARK-34863 for more info), we'll need to capture Parquet specific information such as max repetition/definition level for Spark's `DataType`, so that we can assemble primitive column vectors into ones with complex type (e.g., struct, map, array). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Extended the test cases in `ParquetSchemaSuite` Closes #34199 from sunchao/SPARK-36935-column-io. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
- Loading branch information
1 parent
293c085
commit d246010
Showing
6 changed files
with
1,461 additions
and
86 deletions.
There are no files selected for viewing
40 changes: 40 additions & 0 deletions
40
sql/core/src/main/java/org/apache/parquet/io/ColumnIOUtil.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.parquet.io; | ||
|
||
/** | ||
* This is a workaround since methods below are not public in {@link ColumnIO}. | ||
* | ||
* TODO(SPARK-36511): we should remove this once PARQUET-2050 and PARQUET-2083 are released with | ||
* Parquet 1.13. | ||
*/ | ||
public class ColumnIOUtil { | ||
private ColumnIOUtil() {} | ||
|
||
public static int getDefinitionLevel(ColumnIO column) { | ||
return column.getDefinitionLevel(); | ||
} | ||
|
||
public static int getRepetitionLevel(ColumnIO column) { | ||
return column.getRepetitionLevel(); | ||
} | ||
|
||
public static String[] getFieldPath(ColumnIO column) { | ||
return column.getFieldPath(); | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
55 changes: 55 additions & 0 deletions
55
...ore/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetColumn.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.execution.datasources.parquet | ||
|
||
import org.apache.parquet.column.ColumnDescriptor | ||
import org.apache.parquet.io.ColumnIOUtil | ||
import org.apache.parquet.io.GroupColumnIO | ||
import org.apache.parquet.io.PrimitiveColumnIO | ||
import org.apache.parquet.schema.Type.Repetition | ||
|
||
import org.apache.spark.sql.types.DataType | ||
|
||
/** | ||
* Rich information for a Parquet column together with its SparkSQL type. | ||
*/ | ||
case class ParquetColumn( | ||
sparkType: DataType, | ||
descriptor: Option[ColumnDescriptor], // only set when this is a primitive column | ||
repetitionLevel: Int, | ||
definitionLevel: Int, | ||
required: Boolean, | ||
path: Seq[String], | ||
children: Seq[ParquetColumn]) { | ||
|
||
def isPrimitive: Boolean = descriptor.nonEmpty | ||
} | ||
|
||
object ParquetColumn { | ||
def apply(sparkType: DataType, io: PrimitiveColumnIO): ParquetColumn = { | ||
this(sparkType, Some(io.getColumnDescriptor), ColumnIOUtil.getRepetitionLevel(io), | ||
ColumnIOUtil.getDefinitionLevel(io), io.getType.isRepetition(Repetition.REQUIRED), | ||
ColumnIOUtil.getFieldPath(io), Seq.empty) | ||
} | ||
|
||
def apply(sparkType: DataType, io: GroupColumnIO, children: Seq[ParquetColumn]): ParquetColumn = { | ||
this(sparkType, None, ColumnIOUtil.getRepetitionLevel(io), | ||
ColumnIOUtil.getDefinitionLevel(io), io.getType.isRepetition(Repetition.REQUIRED), | ||
ColumnIOUtil.getFieldPath(io), children) | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.