[spark] fix drop partition when partition value is null or empty string #6662

zhongyujiang · 2025-11-24T05:18:08Z

Purpose

Currently, when using Spark to delete Paimon partitions, deletion fails or incorrectly removes mismatched data if the partition value is NULL or an empty string. This issue occurs across ALTER TABLE ... DROP PARTITION, DELETE FROM, and TRUNCATE PARTITION

ALTER TABLE T DROP PARTITION and DELETE FROM
Dropping a partition with partitionField = '' erroneously deletes data where partitionField = null.

The root cause is that during deletion, Spark SQL passes partition values that are replaced by InternalRowPartitionComputer with the default partition value. However, the actual partition filter should use the original (real) partition values for matching, because the manifest files store the real partition values—not the default ones.

This PR resolves the issue by preserving the original partition values (instead of replacing them with the default partition value) when computing partitions for data deletion: InternalRowPartitionComputer preserveNullOrEmptyValue

TRUNCATE PARTITION
Throws a NullPointerException (NPE) when the partition value is null.

Fixed by safely handling null values.

Tests

Added in the PR.

API and Format

No.

Documentation

No.

… or empty string.

zhongyujiang · 2025-11-24T05:18:57Z

cc @JingsongLi @Zouxxyy @YannByron Can you take a look when you have time? Thanks!

JingsongLi

You mean HiveCatalog? Can we fix HiveCatalog.dropPartitions to deal with default values?

zhongyujiang · 2025-11-24T07:56:16Z

@JingsongLi Thanks for looking at this.

I think this issue occurs for all catalogs. The root cause lies in Spark’s current partition-dropping logic, which incorrectly transforms partition values using InternalRowPartitionComputer—not in the catalog implementation itself. Here's how the issue unfolds:

When executing ALTER TABLE T DROP PARTITION(pt=''), PaimonPartitionManagement#toPaimonPartitions uses InternalRowPartitionComputer to convert the Spark Row into a partition spec of type Map<String, String>. During this conversion, the empty string ('') is replaced with the default partition value, resulting in (pt='DEFAULT_PARTITION').

This spec is then passed to FileStoreCommitImpl#dropPartitions to delete the partition.

Inside dropPartitions, the partition spec (pt='DEFAULT_PARTITION') is converted back into a Paimon internal row via InternalRowPartitionComputer#convertSpecToInternalRow to construct the partition predicate. However, at this stage, 'DEFAULT_PARTITION' is converted to null (see code below):

paimon/paimon-common/src/main/java/org/apache/paimon/utils/InternalRowPartitionComputer.java

Lines 135 to 153 in 4a7cdf4

    
           public static GenericRow convertSpecToInternalRow( 
        
                   Map<String, String> spec, RowType partType, String defaultPartValue) { 
        
               checkArgument( 
        
                       spec.size() == partType.getFieldCount(), 
        
                       "Partition spec %s size not match partition type %s", 
        
                       spec, 
        
                       partType); 
        
               GenericRow partRow = new GenericRow(spec.size()); 
        
               List<String> fieldNames = partType.getFieldNames(); 
        
               for (Map.Entry<String, String> entry : spec.entrySet()) { 
        
                   Object value = 
        
                           defaultPartValue != null && defaultPartValue.equals(entry.getValue()) 
        
                                   ? null 
        
                                   : castFromString( 
        
                                           entry.getValue(), partType.getField(entry.getKey()).type()); 
        
                   partRow.setField(fieldNames.indexOf(entry.getKey()), value); 
        
               } 
        
               return partRow; 
        
           }

As a result, data with pt = '' is not deleted; instead, data with pt = null gets deleted (if it exists), leading to incorrect behavior.

This also happens in DELETE FROM.

JingsongLi · 2025-11-24T08:32:40Z

@zhongyujiang Are you referring to the issue of mixing empty strings and null?

JingsongLi · 2025-11-24T08:33:35Z

This may be a tricky issue, as Hive's previous definition was to treat empty strings and null as equivalent.

zhongyujiang · 2025-11-24T10:14:23Z

@zhongyujiang Are you referring to the issue of mixing empty strings and null?

@JingsongLi Is not that the values are "mixed", but that the data with partition column values equal to an empty string cannot be deleted when using DELETE FROM or ALTER TABLE T DROP PARTITION.

Reproduce:

  test("Paimon Partition Management: partition values are empty string") {
    spark.sql(s"""
                 |CREATE TABLE T (pt STRING, data STRING)
                 |PARTITIONED BY (pt)
                 |""".stripMargin)

    sql("INSERT INTO T VALUES('', 'a'), ('2', 'b')")

    sql("ALTER TABLE T DROP PARTITION (pt = '')")

    spark.sql("SELECT * FROM T").show(false)
    //+---+----+
    //|pt |data|
    //+---+----+
    //|2  |b   |
    //|   |a   |
    //+---+----+
  }

This may be a tricky issue, as Hive's previous definition was to treat empty strings and null as equivalent.

Oh, I wasn't aware of that. I noticed that Paimon currently places data with partition values of both null and empty string into the same partition—the default partition. Is Paimon intended to be consistent with Hive in this behavior?

JingsongLi · 2025-11-24T10:42:46Z

Oh, I wasn't aware of that. I noticed that Paimon currently places data with partition values of both null and empty string into the same partition—the default partition. Is Paimon intended to be consistent with Hive in this behavior?

My answer is, if possible, it's best for us to maintain the status quo. Can this solve your business problem?

zhongyujiang · 2025-11-24T11:47:54Z

if possible, it's best for us to maintain the status quo.

I think that in Paimon's manifest, partCol = null and partCol = '' are actually treated as two distinct partitions, because their _PARTITION values in the manifest are different. Querying the files table or partitions table shows two separate partitions too, for both FileSystemCatalog and HiveCatalog.

In other words, in Paimon's metadata, these are currently considered two different partitions. Just that data files of both types are stored under the same path \_\_DEFAULT_PARTITION\_\_.

Are you saying that when using HiveCatalog, these two partitions get merged into a single partition when synced to the HMS? And when doing partition deletion, would data from both null and empty string be deleted together? I’m not clear on this point.

Can this solve your business problem?

The issue we’re encountering now is that data with partCol = '' cannot be deleted using either DELETE FROM or ALTER TABLE ... DROP PARTITION. Instead, it erroneously deletes data where partCol = null, which is not the expected behavior. So I fixed the push down of partition values to ensure that truncatePartitions works correctly in this PR.

zhongyujiang · 2025-11-24T11:51:52Z

Actually, Spark's TRUNCATE TABLE command already works this way—it correctly passes down the partCol = '' value, so it can properly delete the partition where partCol = ''. However, when partCol = null, this command will throw a NPE.
My first commit in this PR fixes this issue.

paimon/paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/catalyst/analysis/PaimonIncompatiblePHRRules.scala

Lines 31 to 45 in 69b8159

    
           override def apply(plan: LogicalPlan): LogicalPlan = { 
        
             plan match { 
        
               case t @ TruncatePartition(PaimonRelation(table), ResolvedPartitionSpec(names, ident, _)) 
        
                   if t.resolved => 
        
                 assert(names.length == ident.numFields, "Names and values of partition don't match") 
        
                 val resolver = session.sessionState.conf.resolver 
        
                 val schema = table.schema 
        
                 val partitionSpec = names.zipWithIndex.map { 
        
                   case (name, index) => 
        
                     val field = schema.find(f => resolver(f.name, name)).getOrElse { 
        
                       throw new RuntimeException(s"$name is not a valid partition column in $schema.") 
        
                     } 
        
                     (name -> ident.get(index, field.dataType).toString) 
        
                 }.toMap 
        
                 PaimonTruncateTableCommand(table, partitionSpec)

zhongyujiang · 2025-12-01T10:01:39Z

@JingsongLi Do you have time to take another look? Thanks!

Part of #6662

JingsongLi · 2025-12-02T06:08:22Z

I merged your first commit.

JingsongLi · 2025-12-02T06:09:23Z

Maybe can we provide a void truncatePartitions(List<BinaryRow> partitions) for BatchTableCommit?

Part of apache#6662 (cherry picked from commit 1a1ff56)

zhongyujiang added 3 commits November 23, 2025 12:00

Fix spark truncate null partitions.

4028836

Fix spark delete from and drop partition when partition value is null…

4a7cdf4

… or empty string.

Fix tests.

27016cb

JingsongLi reviewed Nov 24, 2025

View reviewed changes

JingsongLi pushed a commit that referenced this pull request Dec 2, 2025

[spark] Fix NPE in spark truncate null partitions

1a1ff56

Part of #6662

discivigour pushed a commit to discivigour/paimon that referenced this pull request Dec 9, 2025

[spark] Fix NPE in spark truncate null partitions

6f82228

Part of apache#6662 (cherry picked from commit 1a1ff56)

discivigour pushed a commit to discivigour/paimon that referenced this pull request Dec 9, 2025

[spark] Fix NPE in spark truncate null partitions

c6f54be

Part of apache#6662 (cherry picked from commit 1a1ff56)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] fix drop partition when partition value is null or empty string #6662

[spark] fix drop partition when partition value is null or empty string #6662

Uh oh!

zhongyujiang commented Nov 24, 2025

Uh oh!

zhongyujiang commented Nov 24, 2025

Uh oh!

JingsongLi left a comment

Uh oh!

zhongyujiang commented Nov 24, 2025

Uh oh!

JingsongLi commented Nov 24, 2025

Uh oh!

JingsongLi commented Nov 24, 2025

Uh oh!

zhongyujiang commented Nov 24, 2025

Uh oh!

JingsongLi commented Nov 24, 2025

Uh oh!

zhongyujiang commented Nov 24, 2025

Uh oh!

zhongyujiang commented Nov 24, 2025

Uh oh!

zhongyujiang commented Dec 1, 2025

Uh oh!

JingsongLi commented Dec 2, 2025

Uh oh!

JingsongLi commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[spark] fix drop partition when partition value is null or empty string #6662

Are you sure you want to change the base?

[spark] fix drop partition when partition value is null or empty string #6662

Uh oh!

Conversation

zhongyujiang commented Nov 24, 2025

Purpose

Tests

API and Format

Documentation

Uh oh!

zhongyujiang commented Nov 24, 2025

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

zhongyujiang commented Nov 24, 2025

Uh oh!

JingsongLi commented Nov 24, 2025

Uh oh!

JingsongLi commented Nov 24, 2025

Uh oh!

zhongyujiang commented Nov 24, 2025

Uh oh!

JingsongLi commented Nov 24, 2025

Uh oh!

zhongyujiang commented Nov 24, 2025

Uh oh!

zhongyujiang commented Nov 24, 2025

Uh oh!

zhongyujiang commented Dec 1, 2025

Uh oh!

JingsongLi commented Dec 2, 2025

Uh oh!

JingsongLi commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants