Skip to content

[SUPPORT] Slashes in partition columns #10754

@eshu

Description

@eshu

When the partition column contains the slash character ("/"), Hudi could write the data incorrectly or do not read the back.

Test (I use some helpers to write and read Hudi data, they write write data to the local FS and read it):

class HudiPartitionPathTest extends AnyFlatSpec with Matchers with TestHudi {
  "Partition paths" should "be generated properly" in {
    val data = rows(
      (1, "one", "partition with space"),
      (2, "two", "partition with space"),
      (3, "three", "partition-with-dashes"),
      (4, "four", "partition-with-dashes"),
      (5, "five", "partition="),
      (6, "six", "partition="),
      (7, "seven", "partition%"),
      (8, "eight", "partition%"),
      (9, "nine", "partition"),
      (10, "ten", "partition"),
      (11, "eleven", "partition/"),
      (12, "twelve", "partition/"),
      (13, "thirteen", "partition/slaanesh"),
      (14, "fourteen", "partition/slaanesh")
    )
    val path = createHudiDataset(
      getClass.getName,
      data,
      schema("id" -> "int", "value" -> "string", "partition" -> "string")
    )
    val fsPartitionPaths = allFSPartitionPaths(new File(path), path.length + 1)
    println(fsPartitionPaths mkString "\n")
    val df = readHudiDataset(path)
    df show false
    val partitionPaths = df.select("_hoodie_partition_path").dropDuplicates.collect().map(_.getString(0)).toSet
    fsPartitionPaths shouldEqual partitionPaths
  }

  private val filter: FilenameFilter = (_, name) => !name.startsWith(".")

  def allFSPartitionPaths(dir: File, prefixLength: Int): Set[String] =
    (dir.listFiles(filter) foldLeft Set.empty[String]) { (paths, file) =>
      if (file.isFile) paths + file.getParent.substring(prefixLength)
      else paths | allFSPartitionPaths(file, prefixLength)
    }
}

The output is

daas_date=partition
daas_date=partition-with-dashes
daas_date=partition with space
daas_date=partition%
daas_date=partition/slaanesh
daas_date=partition=
+-------------------+---------------------+------------------+-------------------------------+-----------------------------------------------------------------------+---+------+---------------------+-------------------+
|_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key|_hoodie_partition_path         |_hoodie_file_name                                                      |id |value |daas_date            |daas_internal_ts   |
+-------------------+---------------------+------------------+-------------------------------+-----------------------------------------------------------------------+---+------+---------------------+-------------------+
|20240226105035783  |20240226105035783_6_0|11                |daas_date=partition/           |961e9ba8-acef-4957-9357-7032702f1bb2-0_6-4-16_20240226105035783.parquet|11 |eleven|partition/           |2024-02-26 08:25:05|
|20240226105035783  |20240226105035783_6_1|12                |daas_date=partition/           |961e9ba8-acef-4957-9357-7032702f1bb2-0_6-4-16_20240226105035783.parquet|12 |twelve|partition/           |2024-02-26 08:25:05|
|20240226105035783  |20240226105035783_1_0|9                 |daas_date=partition            |9f1a3606-25a2-4e79-93fe-f0c03c86cf3d-0_1-4-11_20240226105035783.parquet|9  |nine  |partition            |2024-02-26 08:25:05|
|20240226105035783  |20240226105035783_1_1|10                |daas_date=partition            |9f1a3606-25a2-4e79-93fe-f0c03c86cf3d-0_1-4-11_20240226105035783.parquet|10 |ten   |partition            |2024-02-26 08:25:05|
|20240226105035783  |20240226105035783_2_0|3                 |daas_date=partition-with-dashes|49d7ac36-67ba-44df-be90-9d73b51fdbc2-0_2-4-12_20240226105035783.parquet|3  |three |partition-with-dashes|2024-02-26 08:25:05|
|20240226105035783  |20240226105035783_2_1|4                 |daas_date=partition-with-dashes|49d7ac36-67ba-44df-be90-9d73b51fdbc2-0_2-4-12_20240226105035783.parquet|4  |four  |partition-with-dashes|2024-02-26 08:25:05|
|20240226105035783  |20240226105035783_3_0|1                 |daas_date=partition with space |9d33b484-0e8c-4299-ab8b-235d6c3a3d5d-0_3-4-13_20240226105035783.parquet|1  |one   |partition with space |2024-02-26 08:25:05|
|20240226105035783  |20240226105035783_3_1|2                 |daas_date=partition with space |9d33b484-0e8c-4299-ab8b-235d6c3a3d5d-0_3-4-13_20240226105035783.parquet|2  |two   |partition with space |2024-02-26 08:25:05|
|20240226105035783  |20240226105035783_0_0|7                 |daas_date=partition%           |8e4bfc08-37f8-4f4e-877b-ff5b2218dbd5-0_0-4-10_20240226105035783.parquet|7  |seven |partition%           |2024-02-26 08:25:05|
|20240226105035783  |20240226105035783_0_1|8                 |daas_date=partition%           |8e4bfc08-37f8-4f4e-877b-ff5b2218dbd5-0_0-4-10_20240226105035783.parquet|8  |eight |partition%           |2024-02-26 08:25:05|
|20240226105035783  |20240226105035783_5_0|5                 |daas_date=partition=           |1134b457-b713-48e2-ace4-a99981b3efd8-0_5-4-15_20240226105035783.parquet|5  |five  |partition=           |2024-02-26 08:25:05|
|20240226105035783  |20240226105035783_5_1|6                 |daas_date=partition=           |1134b457-b713-48e2-ace4-a99981b3efd8-0_5-4-15_20240226105035783.parquet|6  |six   |partition=           |2024-02-26 08:25:05|
+-------------------+---------------------+------------------+-------------------------------+-----------------------------------------------------------------------+---+------+---------------------+-------------------+

As you can see rows 13 and 14 was not read, and "partition" and "partition/" on the file system have the same path (I am not sure about the impact, but probably there could be performance issues).

Maybe it would be great to quote some characters in partition paths?

Environment Description

  • Hudi version :
    0.13.1

  • Storage (HDFS/S3/GCS..):
    Local FS

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:writerWrite client and core write operationspriority:highSignificant impact; potential bugsstatus:triagedIssue has been reviewed and categorized

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions