Skip to content

Conversation

@zhongyujiang
Copy link
Contributor

Purpose

Currently, when using Spark to delete Paimon partitions, deletion fails or incorrectly removes mismatched data if the partition value is NULL or an empty string. This issue occurs across ALTER TABLE ... DROP PARTITION, DELETE FROM, and TRUNCATE PARTITION

ALTER TABLE T DROP PARTITION and DELETE FROM
Dropping a partition with partitionField = '' erroneously deletes data where partitionField = null.

The root cause is that during deletion, Spark SQL passes partition values that are replaced by InternalRowPartitionComputer with the default partition value. However, the actual partition filter should use the original (real) partition values for matching, because the manifest files store the real partition values—not the default ones.

This PR resolves the issue by preserving the original partition values (instead of replacing them with the default partition value) when computing partitions for data deletion: InternalRowPartitionComputer preserveNullOrEmptyValue

TRUNCATE PARTITION
Throws a NullPointerException (NPE) when the partition value is null.

Fixed by safely handling null values.

Tests

Added in the PR.

API and Format

No.

Documentation

No.

@zhongyujiang
Copy link
Contributor Author

cc @JingsongLi @Zouxxyy @YannByron Can you take a look when you have time? Thanks!

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean HiveCatalog? Can we fix HiveCatalog.dropPartitions to deal with default values?

@zhongyujiang
Copy link
Contributor Author

@JingsongLi Thanks for looking at this.

I think this issue occurs for all catalogs. The root cause lies in Spark’s current partition-dropping logic, which incorrectly transforms partition values using InternalRowPartitionComputer—not in the catalog implementation itself. Here's how the issue unfolds:

When executing ALTER TABLE T DROP PARTITION(pt=''), PaimonPartitionManagement#toPaimonPartitions uses InternalRowPartitionComputer to convert the Spark Row into a partition spec of type Map<String, String>. During this conversion, the empty string ('') is replaced with the default partition value, resulting in (pt='DEFAULT_PARTITION').

This spec is then passed to FileStoreCommitImpl#dropPartitions to delete the partition.

Inside dropPartitions, the partition spec (pt='DEFAULT_PARTITION') is converted back into a Paimon internal row via InternalRowPartitionComputer#convertSpecToInternalRow to construct the partition predicate. However, at this stage, 'DEFAULT_PARTITION' is converted to null (see code below):

public static GenericRow convertSpecToInternalRow(
Map<String, String> spec, RowType partType, String defaultPartValue) {
checkArgument(
spec.size() == partType.getFieldCount(),
"Partition spec %s size not match partition type %s",
spec,
partType);
GenericRow partRow = new GenericRow(spec.size());
List<String> fieldNames = partType.getFieldNames();
for (Map.Entry<String, String> entry : spec.entrySet()) {
Object value =
defaultPartValue != null && defaultPartValue.equals(entry.getValue())
? null
: castFromString(
entry.getValue(), partType.getField(entry.getKey()).type());
partRow.setField(fieldNames.indexOf(entry.getKey()), value);
}
return partRow;
}

As a result, data with pt = '' is not deleted; instead, data with pt = null gets deleted (if it exists), leading to incorrect behavior.

This also happens in DELETE FROM.

@JingsongLi
Copy link
Contributor

@zhongyujiang Are you referring to the issue of mixing empty strings and null?

@JingsongLi
Copy link
Contributor

This may be a tricky issue, as Hive's previous definition was to treat empty strings and null as equivalent.

@zhongyujiang
Copy link
Contributor Author

@zhongyujiang Are you referring to the issue of mixing empty strings and null?

@JingsongLi Is not that the values are "mixed", but that the data with partition column values equal to an empty string cannot be deleted when using DELETE FROM or ALTER TABLE T DROP PARTITION.

Reproduce:

  test("Paimon Partition Management: partition values are empty string") {
    spark.sql(s"""
                 |CREATE TABLE T (pt STRING, data STRING)
                 |PARTITIONED BY (pt)
                 |""".stripMargin)

    sql("INSERT INTO T VALUES('', 'a'), ('2', 'b')")

    sql("ALTER TABLE T DROP PARTITION (pt = '')")

    spark.sql("SELECT * FROM T").show(false)
    //+---+----+
    //|pt |data|
    //+---+----+
    //|2  |b   |
    //|   |a   |
    //+---+----+
  }

This may be a tricky issue, as Hive's previous definition was to treat empty strings and null as equivalent.

Oh, I wasn't aware of that. I noticed that Paimon currently places data with partition values of both null and empty string into the same partition—the default partition. Is Paimon intended to be consistent with Hive in this behavior?

@JingsongLi
Copy link
Contributor

Oh, I wasn't aware of that. I noticed that Paimon currently places data with partition values of both null and empty string into the same partition—the default partition. Is Paimon intended to be consistent with Hive in this behavior?

My answer is, if possible, it's best for us to maintain the status quo. Can this solve your business problem?

@zhongyujiang
Copy link
Contributor Author

if possible, it's best for us to maintain the status quo.

I think that in Paimon's manifest, partCol = null and partCol = '' are actually treated as two distinct partitions, because their _PARTITION values in the manifest are different. Querying the files table or partitions table shows two separate partitions too, for both FileSystemCatalog and HiveCatalog.

In other words, in Paimon's metadata, these are currently considered two different partitions. Just that data files of both types are stored under the same path \_\_DEFAULT_PARTITION\_\_.

Are you saying that when using HiveCatalog, these two partitions get merged into a single partition when synced to the HMS? And when doing partition deletion, would data from both null and empty string be deleted together? I’m not clear on this point.

Can this solve your business problem?

The issue we’re encountering now is that data with partCol = '' cannot be deleted using either DELETE FROM or ALTER TABLE ... DROP PARTITION. Instead, it erroneously deletes data where partCol = null, which is not the expected behavior. So I fixed the push down of partition values to ensure that truncatePartitions works correctly in this PR.

@zhongyujiang
Copy link
Contributor Author

Actually, Spark's TRUNCATE TABLE command already works this way—it correctly passes down the partCol = '' value, so it can properly delete the partition where partCol = ''. However, when partCol = null, this command will throw a NPE.
My first commit in this PR fixes this issue.

override def apply(plan: LogicalPlan): LogicalPlan = {
plan match {
case t @ TruncatePartition(PaimonRelation(table), ResolvedPartitionSpec(names, ident, _))
if t.resolved =>
assert(names.length == ident.numFields, "Names and values of partition don't match")
val resolver = session.sessionState.conf.resolver
val schema = table.schema
val partitionSpec = names.zipWithIndex.map {
case (name, index) =>
val field = schema.find(f => resolver(f.name, name)).getOrElse {
throw new RuntimeException(s"$name is not a valid partition column in $schema.")
}
(name -> ident.get(index, field.dataType).toString)
}.toMap
PaimonTruncateTableCommand(table, partitionSpec)

@zhongyujiang
Copy link
Contributor Author

@JingsongLi Do you have time to take another look? Thanks!

JingsongLi pushed a commit that referenced this pull request Dec 2, 2025
@JingsongLi
Copy link
Contributor

I merged your first commit.

@JingsongLi
Copy link
Contributor

Maybe can we provide a void truncatePartitions(List<BinaryRow> partitions) for BatchTableCommit?

discivigour pushed a commit to discivigour/paimon that referenced this pull request Dec 9, 2025
discivigour pushed a commit to discivigour/paimon that referenced this pull request Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants