NullPointerException after deleting old partition column #10626

mgmarino · 2024-07-03T19:44:15Z

Apache Iceberg version

1.5.2 (latest release)

Query engine

Spark

Please describe the bug 🐞

We have an iceberg table where we have changed the partitioning, going from an identity partition to hidden partitioning.

The partition specs are defined in the metadata json file:

  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "day",
      "transform" : "identity",
      "source-id" : 6,
      "field-id" : 1000
    } ]
  }, {
    "spec-id" : 1,
    "fields" : [ {
      "name" : "arrival_ts_day",
      "transform" : "day",
      "source-id" : 5,
      "field-id" : 1001
    } ]
  } ],

We did this evolution quite some time ago (I can't unfortunately remember which version of Iceberg we were using at the point we changed the partitioning), and are now trying to clean up the table by removing the old day column. Running a DROP COLUMN in spark (3.5.1, using Iceberg 1.5.2) succeeds, but then a subsequent read on the table, or e.g. the partitions metadata table results in:

Caused by: java.lang.NullPointerException: Type cannot be null
	at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:921)
	at org.apache.iceberg.types.Types$NestedField.<init>(Types.java:447)
	at org.apache.iceberg.types.Types$NestedField.optional(Types.java:416)
	at org.apache.iceberg.PartitionSpec.partitionType(PartitionSpec.java:132)
	at org.apache.iceberg.Partitioning.buildPartitionProjectionType(Partitioning.java:274)
	at org.apache.iceberg.Partitioning.partitionType(Partitioning.java:242)
	at org.apache.iceberg.PartitionsTable.partitions(PartitionsTable.java:167)
	at org.apache.iceberg.PartitionsTable.task(PartitionsTable.java:122)
	at org.apache.iceberg.PartitionsTable.access$1100(PartitionsTable.java:35)
	at org.apache.iceberg.PartitionsTable$PartitionsScan.lambda$new$0(PartitionsTable.java:248)
	at org.apache.iceberg.StaticTableScan.doPlanFiles(StaticTableScan.java:53)
	at org.apache.iceberg.SnapshotScan.planFiles(SnapshotScan.java:139)
	at org.apache.iceberg.BatchScanAdapter.planFiles(BatchScanAdapter.java:119)
	at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.tasks(SparkPartitioningAwareScan.java:174)
	at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.taskGroups(SparkPartitioningAwareScan.java:202)
	at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.outputPartitioning(SparkPartitioningAwareScan.java:104)
	at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.applyOrElse(V2ScanPartitioningAndOrdering.scala:44)
	at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.applyOrElse(V2ScanPartitioningAndOrdering.scala:42)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:517)
	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1249)
	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1248)
	at org.apache.spark.sql.catalyst.plans.logical.Project.mapChildren(basicLogicalOperators.scala:69)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:517)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:517)
	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1249)
	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1248)
	at org.apache.spark.sql.catalyst.plans.logical.LocalLimit.mapChildren(basicLogicalOperators.scala:1563)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:517)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:517)
	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1249)
	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1248)
	at org.apache.spark.sql.catalyst.plans.logical.GlobalLimit.mapChildren(basicLogicalOperators.scala:1542)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:517)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:488)
	at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$.partitioning(V2ScanPartitioningAndOrdering.scala:42)
	at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$.$anonfun$apply$1(V2ScanPartitioningAndOrdering.scala:35)
	at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$.$anonfun$apply$3(V2ScanPartitioningAndOrdering.scala:38)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:91)
	at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$.apply(V2ScanPartitioningAndOrdering.scala:37)
	at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$.apply(V2ScanPartitioningAndOrdering.scala:33)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:222)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:91)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:219)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:211)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:211)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:182)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:182)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:143)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:202)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526)
	... 32 more

This fails in Spark, but writes/commits from Flink (1.18.1, also using Iceberg 1.5.2) also fail following this change. There the stack trace looks like:

java.lang.NullPointerException: Type cannot be null
	at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:921)
	at org.apache.iceberg.types.Types$NestedField.<init>(Types.java:448)
	at org.apache.iceberg.types.Types$NestedField.optional(Types.java:417)
	at org.apache.iceberg.PartitionSpec.partitionType(PartitionSpec.java:132)
	at org.apache.iceberg.util.PartitionSet.lambda$new$0(PartitionSet.java:46)
	at org.apache.iceberg.relocated.com.google.common.collect.RegularImmutableMap.forEach(RegularImmutableMap.java:297)
	at org.apache.iceberg.util.PartitionSet.<init>(PartitionSet.java:46)
	at org.apache.iceberg.util.PartitionSet.create(PartitionSet.java:38)
	at org.apache.iceberg.ManifestFilterManager.<init>(ManifestFilterManager.java:94)
	at org.apache.iceberg.MergingSnapshotProducer$DataFileFilterManager.<init>(MergingSnapshotProducer.java:1028)
	at org.apache.iceberg.MergingSnapshotProducer$DataFileFilterManager.<init>(MergingSnapshotProducer.java:1026)
	at org.apache.iceberg.MergingSnapshotProducer.<init>(MergingSnapshotProducer.java:118)
	at org.apache.iceberg.MergeAppend.<init>(MergeAppend.java:32)
	at org.apache.iceberg.BaseTable.newAppend(BaseTable.java:180)
	at org.apache.iceberg.flink.sink.IcebergFilesCommitter.commitDeltaTxn(IcebergFilesCommitter.java:360)
	at org.apache.iceberg.flink.sink.IcebergFilesCommitter.commitPendingResult(IcebergFilesCommitter.java:298)
	at org.apache.iceberg.flink.sink.IcebergFilesCommitter.commitUpToCheckpoint(IcebergFilesCommitter.java:280)
	at org.apache.iceberg.flink.sink.IcebergFilesCommitter.initializeState(IcebergFilesCommitter.java:198)
	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:122)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:274)
	at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:753)
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:728)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:693)
	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:955)
	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:924)
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:748)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:564)
	at java.base/java.lang.Thread.run(Thread.java:829)

We are using the AWS Glue Catalog to store information about the table. Here are the current table properties set:

+------------------------------------------+-------------------+
|key                                       |value              |
+------------------------------------------+-------------------+
|connector                                 |none               |
|current-snapshot-id                       |2617120118159963811|
|format                                    |iceberg/parquet    |
|format-version                            |2                  |
|history.expire.max-snapshot-age-ms        |6000000            |
|write.metadata.delete-after-commit.enabled|true               |
|write.metadata.previous-versions-max      |2880               |
+------------------------------------------+-------------------+

The only way for us to recover was to force the table to point to the metadata file right before the change.

I can provide the two metadata files if that's helpful, but I would rather do that privately if possible.

This seems quite similar to #7386, the table was initially written using Iceberg 1.2.1.

Please let me know if I can provide any other information!

The text was updated successfully, but these errors were encountered:

dramaticlly · 2024-07-04T00:26:50Z

Based on the stacktrace, looks like partitions table need to evaluate all historical partition specs to build the partition value
per #7551 , but since the column is already dropped from current schema. I am not sure if old partition spec is reference to the old schema or current table schema for looking up the source field id, if it's current I think it would be the cause of NPE. FYI @szehon-ho for better insight

lurnagao-dahua · 2024-07-04T00:40:43Z

Hi, same issues in 10234

mgmarino · 2024-07-04T05:51:36Z

@lurnagao-dahua thanks for linking that, I did not find it when searching. I see a few things possibly different here (though of course the underlying cause could be the same):

No manifests reference the old partition spec id, i.e. sc.table("catalog.db.table.all_manifests").filter("partition_spec_id != 1") returns no rows. sc.table("catalog.db.table.partitions").filter("spec_id != 1") is also empty
After we did a partition evolution we did a full rewrite of the data using rewrite_data_files and any snapshots that referenced old partitions have been expired/removed for some time.

Fokko · 2024-10-28T08:06:01Z

@mgmarino Can you check if this is still the case with 1.6? This is a known issue, and there has been some work around it.

mgmarino · 2024-10-30T07:07:47Z

@Fokko Just tried with Spark 3.5.3, Iceberg 1.6.1, still the same null pointer error following a DROP column. The only recovery was to go back to the previous metadata file (as before).

I'm happy to try and dig into this a little more, just let me know what I can do.

Fokko · 2024-10-30T07:29:49Z

Thanks! If you could share the stack trace of 1.6.1 that would be awesome, hope you still have it at hand. Let's see if we can get this fixed.

mgmarino · 2024-10-30T07:38:01Z

Sure, no problem, I can also provide metadata files (rather via email, etc) if that would help:

[INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace.
	at org.apache.spark.SparkException$.internalError(SparkException.scala:107)
	at org.apache.spark.sql.execution.QueryExecution$.toInternalError(QueryExecution.scala:536)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:548)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:219)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:218)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:77)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:91)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
	at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:4352)
	at org.apache.spark.sql.Dataset.select(Dataset.scala:1542)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:280)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:315)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.NullPointerException: Type cannot be null
	at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:924)
	at org.apache.iceberg.types.Types$NestedField.<init>(Types.java:448)
	at org.apache.iceberg.types.Types$NestedField.optional(Types.java:417)
	at org.apache.iceberg.PartitionSpec.partitionType(PartitionSpec.java:134)
	at org.apache.iceberg.Partitioning.buildPartitionProjectionType(Partitioning.java:274)
	at org.apache.iceberg.Partitioning.partitionType(Partitioning.java:242)
	at org.apache.iceberg.spark.source.SparkTable.metadataColumns(SparkTable.java:258)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation.metadataOutput$lzycompute(DataSourceV2Relation.scala:59)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation.metadataOutput(DataSourceV2Relation.scala:56)
	at org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias.metadataOutput(basicLogicalOperators.scala:1692)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3(Analyzer.scala:1051)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$3$adapted(Analyzer.scala:1051)
	at scala.collection.Iterator.exists(Iterator.scala:969)
	at scala.collection.Iterator.exists$(Iterator.scala:967)
	at scala.collection.AbstractIterator.exists(Iterator.scala:1431)
	at scala.collection.IterableLike.exists(IterableLike.scala:79)
	at scala.collection.IterableLike.exists$(IterableLike.scala:78)
	at scala.collection.AbstractIterable.exists(Iterable.scala:56)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$2(Analyzer.scala:1051)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$2$adapted(Analyzer.scala:1046)
	at org.apache.spark.sql.catalyst.trees.TreeNode.exists(TreeNode.scala:223)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$exists$1(TreeNode.scala:226)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$exists$1$adapted(TreeNode.scala:226)
	at scala.collection.Iterator.exists(Iterator.scala:969)
	at scala.collection.Iterator.exists$(Iterator.scala:967)
	at scala.collection.AbstractIterator.exists(Iterator.scala:1431)
	at scala.collection.IterableLike.exists(IterableLike.scala:79)
	at scala.collection.IterableLike.exists$(IterableLike.scala:78)
	at scala.collection.AbstractIterable.exists(Iterable.scala:56)
	at org.apache.spark.sql.catalyst.trees.TreeNode.exists(TreeNode.scala:226)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$exists$1(TreeNode.scala:226)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$exists$1$adapted(TreeNode.scala:226)
	at scala.collection.Iterator.exists(Iterator.scala:969)
	at scala.collection.Iterator.exists$(Iterator.scala:967)
	at scala.collection.AbstractIterator.exists(Iterator.scala:1431)
	at scala.collection.IterableLike.exists(IterableLike.scala:79)
	at scala.collection.IterableLike.exists$(IterableLike.scala:78)
	at scala.collection.AbstractIterable.exists(Iterable.scala:56)
	at org.apache.spark.sql.catalyst.trees.TreeNode.exists(TreeNode.scala:226)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$1(Analyzer.scala:1046)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.$anonfun$hasMetadataCol$1$adapted(Analyzer.scala:1046)
	at scala.collection.LinearSeqOptimized.exists(LinearSeqOptimized.scala:95)
	at scala.collection.LinearSeqOptimized.exists$(LinearSeqOptimized.scala:92)
	at scala.collection.immutable.Stream.exists(Stream.scala:204)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.org$apache$spark$sql$catalyst$analysis$Analyzer$AddMetadataColumns$$hasMetadataCol(Analyzer.scala:1046)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$$anonfun$apply$13.applyOrElse(Analyzer.scala:1016)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$$anonfun$apply$13.applyOrElse(Analyzer.scala:1013)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$2(AnalysisHelper.scala:170)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$1(AnalysisHelper.scala:170)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:323)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning(AnalysisHelper.scala:168)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning$(AnalysisHelper.scala:164)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDownWithPruning(LogicalPlan.scala:32)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.apply(Analyzer.scala:1013)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns$.apply(Analyzer.scala:1009)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:222)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:91)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:219)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:211)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:211)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:240)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:236)
	at org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:187)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:236)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:202)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:182)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:89)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:182)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:223)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:222)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:77)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:219)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)

Fokko · 2024-10-30T07:42:34Z

@mgmarino If you could try to come up with a minimal example, like

iceberg/core/src/test/java/org/apache/iceberg/TestPartitionSpecInfo.java

Lines 124 to 145 in bedc711

    
           public void testColumnDropWithPartitionSpecEvolution() { 
        
             PartitionSpec spec = PartitionSpec.builderFor(schema).identity("id").build(); 
        
             TestTables.TestTable table = TestTables.create(tableDir, "test", schema, spec, formatVersion); 
        
             assertThat(table.spec()).isEqualTo(spec); 
        
             TableMetadata base = TestTables.readMetadata("test"); 
        
             PartitionSpec newSpec = 
        
                 PartitionSpec.builderFor(table.schema()).identity("data").withSpecId(1).build(); 
        
             table.ops().commit(base, base.updatePartitionSpec(newSpec)); 
        
             int initialColSize = table.schema().columns().size(); 
        
             table.updateSchema().deleteColumn("id").commit(); 
        
             final Schema expectedSchema = new Schema(required(2, "data", Types.StringType.get())); 
        
             assertThat(table.spec()).isEqualTo(newSpec); 
        
             assertThat(table.specs()) 
        
                 .containsExactly(entry(spec.specId(), spec), entry(newSpec.specId(), newSpec)) 
        
                 .doesNotContainKey(Integer.MAX_VALUE); 
        
             assertThat(table.schema().asStruct()).isEqualTo(expectedSchema.asStruct()); 
        
           }

that triggers the same issue that would be awesome. The most important part is to reproduce it, so we can add a test to make sure that this error doesn't re-appear in the future.

mgmarino added the bug Something isn't working label Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NullPointerException after deleting old partition column #10626

NullPointerException after deleting old partition column #10626

mgmarino commented Jul 3, 2024 •

edited

Loading

dramaticlly commented Jul 4, 2024 •

edited

Loading

lurnagao-dahua commented Jul 4, 2024

mgmarino commented Jul 4, 2024

Fokko commented Oct 28, 2024

mgmarino commented Oct 30, 2024

Fokko commented Oct 30, 2024 •

edited

Loading

mgmarino commented Oct 30, 2024

Fokko commented Oct 30, 2024

NullPointerException after deleting old partition column #10626

NullPointerException after deleting old partition column #10626

Comments

mgmarino commented Jul 3, 2024 • edited Loading

Apache Iceberg version

Query engine

Please describe the bug 🐞

dramaticlly commented Jul 4, 2024 • edited Loading

lurnagao-dahua commented Jul 4, 2024

mgmarino commented Jul 4, 2024

Fokko commented Oct 28, 2024

mgmarino commented Oct 30, 2024

Fokko commented Oct 30, 2024 • edited Loading

mgmarino commented Oct 30, 2024

Fokko commented Oct 30, 2024

mgmarino commented Jul 3, 2024 •

edited

Loading

dramaticlly commented Jul 4, 2024 •

edited

Loading

Fokko commented Oct 30, 2024 •

edited

Loading