Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48177][BUILD] Upgrade Apache Parquet to 1.14.0 #46447

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented May 7, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Fixes quite a few bugs on the Parquet side: https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1140

Does this PR introduce any user-facing change?

No

How was this patch tested?

Using the existing unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the BUILD label May 7, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, finally.

Please run the following and attach the updated dependency file, @Fokko .

dev/test-dependencies.sh --replace-manifest

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-48177][BUILD]: Bump Apache Parquet to 1.14.0 [SPARK-48177][BUILD] Upgrade Apache Parquet to 1.14.0 May 7, 2024
@dongjoon-hyun
Copy link
Member

cc @cloud-fan , @HyukjinKwon , @mridulm , @sunchao , @yaooqinn , @LuciferYang , @steveloughran , @viirya , @huaxin, @parthchandra , too.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it seems that there exist many unit test failures.

[info] *** 189 TESTS FAILED ***
[error] Failed: Total 1526, Failed 189, Errors 0, Passed 1337, Ignored 597
[error] Failed tests:
[error] 	org.apache.spark.sql.hive.execution.SQLQuerySuite
[error] 	org.apache.spark.sql.hive.execution.HiveResolutionSuite
[error] 	org.apache.spark.sql.hive.execution.HiveDDLSuite
[error] 	org.apache.spark.sql.hive.execution.HiveQuerySuite
[error] 	org.apache.spark.sql.hive.execution.SQLQuerySuiteAE
[error] 	org.apache.spark.sql.hive.execution.HiveSQLViewSuite
[error] 	org.apache.spark.sql.hive.execution.HashUDAQuerySuite
[error] 	org.apache.spark.sql.hive.execution.PruneHiveTablePartitionsSuite
[error] 	org.apache.spark.sql.hive.execution.HiveUDAFSuite
[error] 	org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite
[error] 	org.apache.spark.sql.hive.execution.HiveTableScanSuite
[error] 	org.apache.spark.sql.hive.execution.HashAggregationQueryWithControlledFallbackSuite
[error] 	org.apache.spark.sql.hive.execution.HiveCommandSuite
[error] 	org.apache.spark.sql.hive.execution.HashUDAQueryWithControlledFallbackSuite
[error] 	org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite
[error] 	org.apache.spark.sql.hive.execution.HiveUDFSuite
[error] 	org.apache.spark.sql.hive.HiveSparkSubmitSuite
[error] 	org.apache.spark.sql.hive.execution.HashAggregationQuerySuite
[error] (hive / Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 1448 s (24:08), completed May 7, 2024, 9:07:49 PM

For example,

- SPARK-6851: Self-joined converted parquet tables *** FAILED *** (4 seconds, 473 milliseconds)
[info]   java.util.concurrent.ExecutionException: org.apache.spark.SparkException:
[FAILED_READ_FILE.NO_HINT] Encountered error while reading file 

file:///home/runner/work/spark/spark/target/tmp/warehouse-75fc0262-e914-40da-98bf-ad2460270fb5/orders/state=CA/month=20151/part-00000-d46019ae-951c-4974-96da-2b38ade7b49e.c000.snappy.parquet.  SQLSTATE: KD001

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 7, 2024

Oh, it seems that wrong target folder files are added.

FYI, this PR is supposed to have two files: pom.xml and dev/deps/spark-deps-hadoop-3-hive-2.3.

@Fokko
Copy link
Contributor Author

Fokko commented May 7, 2024

Thanks for pointing out @dongjoon-hyun. I've fixed it right away 👍

@Fokko
Copy link
Contributor Author

Fokko commented May 7, 2024

I have to look into the tests 👀

@rshkv
Copy link
Contributor

rshkv commented May 21, 2024

I think the toPrettyJson errors seen here are reported in PARQUET-2468 and being addressed in apache/parquet-java#1349. We might have to wait for 1.14.1.

Cause: java.lang.RuntimeException: shaded.parquet.com.fasterxml.jackson.databind.exc.InvalidDefinitionException: No serializer found for class org.apache.parquet.schema.LogicalTypeAnnotation$StringLogicalTypeAnnotation and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationFeature.FAIL_ON_EMPTY_BEANS) (through reference chain: org.apache.parquet.hadoop.metadata.ParquetMetadata["fileMetaData"]->org.apache.parquet.hadoop.metadata.FileMetaData["schema"]->org.apache.parquet.schema.MessageType["fields"]->java.util.ArrayList[1]->org.apache.parquet.schema.PrimitiveType["logicalTypeAnnotation"])
	at org.apache.parquet.hadoop.metadata.ParquetMetadata.toJSON(ParquetMetadata.java:68)
	at org.apache.parquet.hadoop.metadata.ParquetMetadata.toPrettyJSON(ParquetMetadata.java:48)
	at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1592)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:629)
Caused by: shaded.parquet.com.fasterxml.jackson.databind.exc.InvalidDefinitionException: Java 8 optional type `java.util.Optional<java.lang.Long>` not supported by default: add Module "shaded.parquet.com.fasterxml.jackson.datatype:jackson-datatype-jdk8" to enable handling (through reference chain: org.apache.parquet.hadoop.metadata.ParquetMetadata["blocks"]->java.util.ArrayList[0]->org.apache.parquet.hadoop.metadata.BlockMetaData["columns"]->java.util.Collections$UnmodifiableRandomAccessList[0]->org.apache.parquet.hadoop.metadata.IntColumnChunkMetaData["sizeStatistics"]->org.apache.parquet.column.statistics.SizeStatistics["unencodedByteArrayDataBytes"])
	at shaded.parquet.com.fasterxml.jackson.databind.exc.InvalidDefinitionException.from(InvalidDefinitionException.java:77)
	...
	at shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:1114)
	at org.apache.parquet.hadoop.metadata.ParquetMetadata.toJSON(ParquetMetadata.java:62)
	at org.apache.parquet.hadoop.metadata.ParquetMetadata.toPrettyJSON(ParquetMetadata.java:48)
	at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1592)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:629)

@Fokko
Copy link
Contributor Author

Fokko commented May 21, 2024

Thanks for digging into this @rshkv, let's follow up on the Parquet side

@dongjoon-hyun
Copy link
Member

Thank you, @rshkv and @Fokko .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants