[SUPPORT] parquet bloom filters not supported by hudi #7117

parisni · 2022-11-02T11:05:51Z

hudi 0.12.1
spark 3.2.1

hudi has its own bloom implementation used mainly for fast lookup on the hudi key. Up to 0.11 hudi allow to store them in the metadata table, and also add bloom on several columns. So far those bloom are not use at read time.

Spark leverage parquet bloom filters at write time. They are then used at read time to skip files. Those bloom can improve queries a lot. But so far, hudi don't support them :

Basic parquet table => bloom works as expected: the accumulator is not used when filtered values are outside df values, but incremented when values matches

spark.sql("set parquet.filter.bloom.enabled=true")
spark.sql("set parquet.filter.columnindex.enabled=false")
spark.sql("set parquet.filter.stats.enabled=false")

import org.apache.spark.util.{AccumulatorContext, AccumulatorV2, Utils}
import org.apache.spark.sql.Row
{spark.sql("select '0' as a, '1' as b, '2' c, 4 as d union select '1', '2', '3', 4").write.mode("overwrite").option("parquet.bloom.filter.enabled#b", "true").parquet("/tmp/bloom")}

class NumRowGroupsAcc extends AccumulatorV2[Integer, Integer] {
  private var _sum = 0
  override def isZero: Boolean = _sum == 0
  override def copy(): AccumulatorV2[Integer, Integer] = {
    val acc = new NumRowGroupsAcc()
    acc._sum = _sum
    acc
  }
  override def reset(): Unit = _sum = 0
  override def add(v: Integer): Unit = _sum += v
  override def merge(other: AccumulatorV2[Integer, Integer]): Unit = other match {
    case a: NumRowGroupsAcc => _sum += a._sum
    case _ => throw new UnsupportedOperationException(
      s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
  }
  override def value: Integer = _sum
}
val accu = new NumRowGroupsAcc
sc.register(accu)
spark.read.format("parquet").load("/tmp/bloom").filter("b = '3'").foreachPartition((it: Iterator[Row]) => it.foreach(_ => accu.add(0)))
println(accu)
spark.read.format("parquet").load("/tmp/bloom").filter("b = '2'").foreachPartition((it: Iterator[Row]) => it.foreach(_ => accu.add(0)))
println(accu)

/**
NumRowGroupsAcc(id: 5611, name: None, value: 0)
NumRowGroupsAcc(id: 5611, name: None, value: 1)
/**

Hudi parquet table => bloom works NOT as expected: the accumulator is incremented when match and no match

import org.apache.spark.util.{AccumulatorContext, AccumulatorV2, Utils}
import org.apache.spark.sql.Row
spark.sql("set parquet.filter.bloom.enabled=true")
spark.sql("set parquet.filter.columnindex.enabled=false")
spark.sql("set parquet.filter.stats.enabled=false")
import org.apache.spark.util.{AccumulatorContext, AccumulatorV2, Utils}

val basePath = "/tmp/bloom_hudi/"
val hiveOptions = Map[String, String](
  "hoodie.table.name"-> "hudi_test",
"hoodie.datasource.write.recordkey.field"-> "a",
"hoodie.datasource.write.partitionpath.field"-> "d",
"hoodie.datasource.write.precombine.field"-> "c",
"hoodie.datasource.write.table.name"-> "hudi_test",
"hoodie.datasource.write.operation" -> "bulk_insert",
"parquet.bloom.filter.enabled#b" -> "true"
)
spark.sql("select '0' as a, '1' as b, '2' c, 4 as d union select '1', '2', '3', 4").write.format("hudi").
options(hiveOptions).
mode("overwrite").
save(basePath)

class NumRowGroupsAcc extends AccumulatorV2[Integer, Integer] {
  private var _sum = 0
  override def isZero: Boolean = _sum == 0
  override def copy(): AccumulatorV2[Integer, Integer] = {
    val acc = new NumRowGroupsAcc()
    acc._sum = _sum
    acc
  }
  override def reset(): Unit = _sum = 0
  override def add(v: Integer): Unit = _sum += v
  override def merge(other: AccumulatorV2[Integer, Integer]): Unit = other match {
    case a: NumRowGroupsAcc => _sum += a._sum
    case _ => throw new UnsupportedOperationException(
      s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
  }
  override def value: Integer = _sum
}
val accu = new NumRowGroupsAcc
sc.register(accu)
spark.read.format("hudi").load(basePath).filter("b = '3'").foreachPartition((it: Iterator[Row]) => it.foreach(_ => accu.add(0)))
println(accu)
spark.read.format("hudi").load(basePath).filter("b = '2'").foreachPartition((it: Iterator[Row]) => it.foreach(_ => accu.add(0)))
println(accu)
/**
accu: NumRowGroupsAcc = NumRowGroupsAcc(id: 635, name: None, value: 1)
accu: NumRowGroupsAcc = NumRowGroupsAcc(id: 635, name: None, value: 2)

/*

parisni · 2023-02-14T16:15:50Z

hi @nsivabalan bloom at read time is an useful feature for read performance. thought ?

parisni · 2023-05-14T22:42:24Z

Hudi is able to benefit from parquet files written with blooms. (tested by replacing the hudi parquet files with the vanilla spark's one, and it hudi datasource triggers the bloom).

Digging the source code, I guess the reason blooms are not taken in consideration is in the hudi's parquetWriter wrapper . It then calls the parquetWriter public constructor which has very limited parquet feature support. There is a more complete constructor but sadly it's access is limited to package.

Accessing to package constructor can be done by changing the HoodieBaseParquetWriter package to org.apache.parquet.hadoop, but also the ParquetWriter has to be present in the same jar (common package cannot be spread over multiple jars).

A better option would be parquet provides more suitable constructors.

Also we could make use of the builder and there is some pointers here

parisni · 2023-05-15T08:31:50Z

@danny0405 thought ?

danny0405 · 2023-05-15T08:58:08Z

which has very limited parquet feature support. There is a more complete constructor but sadly it's access is limited to package.

2 questions:

What is exactly the feature you want for parquet writers? The BloomFilter written by Hudi parquet writer wrapper does not meet your request?
How spark writes their BloomFilters? How they solved the package/accessability issue as you mentioned.

parisni · 2023-05-15T09:01:22Z

as for 1 please see the OP and let me know if this is still unclear.

as for 2, spark leverages parquet library to writes blooms. It just pass the properties from the user with the datasource api

danny0405 · 2023-05-15T09:13:16Z

Yeah, please elaborate more about 1, and for 2, can Hudi also take the similiar manner for writers?

parisni · 2023-05-15T09:19:08Z

hudi should support parquet vanilla bloom filters, because this is a standard optimization method supported by every query engines using parquet 1.12 and above. Moreover hudi does not provide such optimization method. Hudi blooms are not used for select queries. Hudi blooms are only useful for update operations. Providing vanilla parquet bloom support to hudi would allow an other set of optimization (such z-order, parquet stats) for almost free.

as for 2, providing bloom filters to hudi is just passing the configurations to parquet. But after analysis, it likely needs a small refactoring of the way we configure parquet writers. (by stopping using deprecated ParquetWriter constructor and use the builder method which support all parquet configurations)

parisni · 2023-05-15T09:22:31Z

more details:

danny0405 · 2023-05-16T01:52:53Z

by stopping using deprecated ParquetWriter constructor and use the builder method which support all parquet configurations

I'm okay with that.

jonvex · 2024-01-29T18:44:58Z

How does

accu.add(0)

increment when

override def add(v: Integer): Unit = _sum += v

parisni · 2024-01-31T15:06:10Z

that's a good point. I don't know, I found that code in the spark tests. The point is it does increment !

jonvex · 2024-01-31T15:57:58Z

#10278 I am working on the FileGroup Reader for Hudi 1.0 and that test was failing but if I change it to accu.add(1) then it works. So that's why I'm asking. I don't want to break parquet bloom filters.

parisni · 2024-02-01T07:12:44Z

Okay will double check thanks for reaching

jonvex · 2024-04-29T22:29:09Z

@parisni update on this. I am working on getting vectorized reader enabled in more scenarios with the new fg reader and now that test works again

nsivabalan added feature-enquiry issue contains feature enquiries/requests or great improvement ideas priority:minor everything else; usability gaps; questions; feature reqs labels Nov 8, 2022

danny0405 self-assigned this May 15, 2023

danny0405 added priority:major degraded perf; unable to move forward; potential bugs spark-sql performance and removed priority:minor everything else; usability gaps; questions; feature reqs labels May 15, 2023

parisni mentioned this issue May 15, 2023

[HUDI-6226] Support parquet native bloom filters #8716

Merged

4 tasks

danny0405 closed this as completed in #8716 Jun 26, 2023

This was referenced Jul 23, 2024

[SUPPORT] Query about using Hudi in Pyspark for BigQuery sync numberlabs-developers/hudi#225

Open

[SUPPORT] Biquery support in Hudi using Pyspark code numberlabs-developers/hudi#226

Open

torvalds-dev-testbot bot mentioned this issue Sep 11, 2024

[SUPPORT] Biquery support in Hudi using PySpark code numberlabs-developers/hudi#255

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] parquet bloom filters not supported by hudi #7117

[SUPPORT] parquet bloom filters not supported by hudi #7117

parisni commented Nov 2, 2022 •

edited

Loading

parisni commented Feb 14, 2023

parisni commented May 14, 2023 •

edited

Loading

parisni commented May 15, 2023

danny0405 commented May 15, 2023

parisni commented May 15, 2023

danny0405 commented May 15, 2023

parisni commented May 15, 2023

parisni commented May 15, 2023

danny0405 commented May 16, 2023

jonvex commented Jan 29, 2024

parisni commented Jan 31, 2024

jonvex commented Jan 31, 2024

parisni commented Feb 1, 2024

jonvex commented Apr 29, 2024

[SUPPORT] parquet bloom filters not supported by hudi #7117

[SUPPORT] parquet bloom filters not supported by hudi #7117

Comments

parisni commented Nov 2, 2022 • edited Loading

parisni commented Feb 14, 2023

parisni commented May 14, 2023 • edited Loading

parisni commented May 15, 2023

danny0405 commented May 15, 2023

parisni commented May 15, 2023

danny0405 commented May 15, 2023

parisni commented May 15, 2023

parisni commented May 15, 2023

danny0405 commented May 16, 2023

jonvex commented Jan 29, 2024

parisni commented Jan 31, 2024

jonvex commented Jan 31, 2024

parisni commented Feb 1, 2024

jonvex commented Apr 29, 2024

parisni commented Nov 2, 2022 •

edited

Loading

parisni commented May 14, 2023 •

edited

Loading