Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] parquet bloom filters not supported by hudi #7117

Closed
parisni opened this issue Nov 2, 2022 · 14 comments · Fixed by #8716
Closed

[SUPPORT] parquet bloom filters not supported by hudi #7117

parisni opened this issue Nov 2, 2022 · 14 comments · Fixed by #8716
Assignees
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas performance priority:major degraded perf; unable to move forward; potential bugs spark-sql

Comments

@parisni
Copy link
Contributor

parisni commented Nov 2, 2022

hudi 0.12.1
spark 3.2.1

hudi has its own bloom implementation used mainly for fast lookup on the hudi key. Up to 0.11 hudi allow to store them in the metadata table, and also add bloom on several columns. So far those bloom are not use at read time.

Spark leverage parquet bloom filters at write time. They are then used at read time to skip files. Those bloom can improve queries a lot. But so far, hudi don't support them :

Basic parquet table => bloom works as expected: the accumulator is not used when filtered values are outside df values, but incremented when values matches

spark.sql("set parquet.filter.bloom.enabled=true")
spark.sql("set parquet.filter.columnindex.enabled=false")
spark.sql("set parquet.filter.stats.enabled=false")

import org.apache.spark.util.{AccumulatorContext, AccumulatorV2, Utils}
import org.apache.spark.sql.Row
{spark.sql("select '0' as a, '1' as b, '2' c, 4 as d union select '1', '2', '3', 4").write.mode("overwrite").option("parquet.bloom.filter.enabled#b", "true").parquet("/tmp/bloom")}

class NumRowGroupsAcc extends AccumulatorV2[Integer, Integer] {
  private var _sum = 0
  override def isZero: Boolean = _sum == 0
  override def copy(): AccumulatorV2[Integer, Integer] = {
    val acc = new NumRowGroupsAcc()
    acc._sum = _sum
    acc
  }
  override def reset(): Unit = _sum = 0
  override def add(v: Integer): Unit = _sum += v
  override def merge(other: AccumulatorV2[Integer, Integer]): Unit = other match {
    case a: NumRowGroupsAcc => _sum += a._sum
    case _ => throw new UnsupportedOperationException(
      s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
  }
  override def value: Integer = _sum
}
val accu = new NumRowGroupsAcc
sc.register(accu)
spark.read.format("parquet").load("/tmp/bloom").filter("b = '3'").foreachPartition((it: Iterator[Row]) => it.foreach(_ => accu.add(0)))
println(accu)
spark.read.format("parquet").load("/tmp/bloom").filter("b = '2'").foreachPartition((it: Iterator[Row]) => it.foreach(_ => accu.add(0)))
println(accu)

/**
NumRowGroupsAcc(id: 5611, name: None, value: 0)
NumRowGroupsAcc(id: 5611, name: None, value: 1)
/**

Hudi parquet table => bloom works NOT as expected: the accumulator is incremented when match and no match

import org.apache.spark.util.{AccumulatorContext, AccumulatorV2, Utils}
import org.apache.spark.sql.Row
spark.sql("set parquet.filter.bloom.enabled=true")
spark.sql("set parquet.filter.columnindex.enabled=false")
spark.sql("set parquet.filter.stats.enabled=false")
import org.apache.spark.util.{AccumulatorContext, AccumulatorV2, Utils}

val basePath = "/tmp/bloom_hudi/"
val hiveOptions = Map[String, String](
  "hoodie.table.name"-> "hudi_test",
"hoodie.datasource.write.recordkey.field"-> "a",
"hoodie.datasource.write.partitionpath.field"-> "d",
"hoodie.datasource.write.precombine.field"-> "c",
"hoodie.datasource.write.table.name"-> "hudi_test",
"hoodie.datasource.write.operation" -> "bulk_insert",
"parquet.bloom.filter.enabled#b" -> "true"
)
spark.sql("select '0' as a, '1' as b, '2' c, 4 as d union select '1', '2', '3', 4").write.format("hudi").
options(hiveOptions).
mode("overwrite").
save(basePath)

class NumRowGroupsAcc extends AccumulatorV2[Integer, Integer] {
  private var _sum = 0
  override def isZero: Boolean = _sum == 0
  override def copy(): AccumulatorV2[Integer, Integer] = {
    val acc = new NumRowGroupsAcc()
    acc._sum = _sum
    acc
  }
  override def reset(): Unit = _sum = 0
  override def add(v: Integer): Unit = _sum += v
  override def merge(other: AccumulatorV2[Integer, Integer]): Unit = other match {
    case a: NumRowGroupsAcc => _sum += a._sum
    case _ => throw new UnsupportedOperationException(
      s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
  }
  override def value: Integer = _sum
}
val accu = new NumRowGroupsAcc
sc.register(accu)
spark.read.format("hudi").load(basePath).filter("b = '3'").foreachPartition((it: Iterator[Row]) => it.foreach(_ => accu.add(0)))
println(accu)
spark.read.format("hudi").load(basePath).filter("b = '2'").foreachPartition((it: Iterator[Row]) => it.foreach(_ => accu.add(0)))
println(accu)
/**
accu: NumRowGroupsAcc = NumRowGroupsAcc(id: 635, name: None, value: 1)
accu: NumRowGroupsAcc = NumRowGroupsAcc(id: 635, name: None, value: 2)

/*
@nsivabalan nsivabalan added feature-enquiry issue contains feature enquiries/requests or great improvement ideas priority:minor everything else; usability gaps; questions; feature reqs labels Nov 8, 2022
@parisni
Copy link
Contributor Author

parisni commented Feb 14, 2023

hi @nsivabalan bloom at read time is an useful feature for read performance. thought ?

@parisni
Copy link
Contributor Author

parisni commented May 14, 2023

Hudi is able to benefit from parquet files written with blooms. (tested by replacing the hudi parquet files with the vanilla spark's one, and it hudi datasource triggers the bloom).

Digging the source code, I guess the reason blooms are not taken in consideration is in the hudi's parquetWriter wrapper . It then calls the parquetWriter public constructor which has very limited parquet feature support. There is a more complete constructor but sadly it's access is limited to package.

Accessing to package constructor can be done by changing the HoodieBaseParquetWriter package to org.apache.parquet.hadoop, but also the ParquetWriter has to be present in the same jar (common package cannot be spread over multiple jars).

A better option would be parquet provides more suitable constructors.

Also we could make use of the builder and there is some pointers here

@parisni
Copy link
Contributor Author

parisni commented May 15, 2023

@danny0405 thought ?

@danny0405
Copy link
Contributor

which has very limited parquet feature support. There is a more complete constructor but sadly it's access is limited to package.

2 questions:

  1. What is exactly the feature you want for parquet writers? The BloomFilter written by Hudi parquet writer wrapper does not meet your request?
  2. How spark writes their BloomFilters? How they solved the package/accessability issue as you mentioned.

@danny0405 danny0405 self-assigned this May 15, 2023
@danny0405 danny0405 added priority:major degraded perf; unable to move forward; potential bugs spark-sql performance and removed priority:minor everything else; usability gaps; questions; feature reqs labels May 15, 2023
@parisni
Copy link
Contributor Author

parisni commented May 15, 2023

as for 1 please see the OP and let me know if this is still unclear.

as for 2, spark leverages parquet library to writes blooms. It just pass the properties from the user with the datasource api

@danny0405
Copy link
Contributor

Yeah, please elaborate more about 1, and for 2, can Hudi also take the similiar manner for writers?

@parisni
Copy link
Contributor Author

parisni commented May 15, 2023

hudi should support parquet vanilla bloom filters, because this is a standard optimization method supported by every query engines using parquet 1.12 and above. Moreover hudi does not provide such optimization method. Hudi blooms are not used for select queries. Hudi blooms are only useful for update operations. Providing vanilla parquet bloom support to hudi would allow an other set of optimization (such z-order, parquet stats) for almost free.

as for 2, providing bloom filters to hudi is just passing the configurations to parquet. But after analysis, it likely needs a small refactoring of the way we configure parquet writers. (by stopping using deprecated ParquetWriter constructor and use the builder method which support all parquet configurations)

@parisni
Copy link
Contributor Author

parisni commented May 15, 2023

more details:

@danny0405
Copy link
Contributor

by stopping using deprecated ParquetWriter constructor and use the builder method which support all parquet configurations

I'm okay with that.

@jonvex
Copy link
Contributor

jonvex commented Jan 29, 2024

How does

accu.add(0)

increment when

override def add(v: Integer): Unit = _sum += v

@parisni
Copy link
Contributor Author

parisni commented Jan 31, 2024

that's a good point. I don't know, I found that code in the spark tests. The point is it does increment !

@jonvex
Copy link
Contributor

jonvex commented Jan 31, 2024

#10278 I am working on the FileGroup Reader for Hudi 1.0 and that test was failing but if I change it to accu.add(1) then it works. So that's why I'm asking. I don't want to break parquet bloom filters.

@parisni
Copy link
Contributor Author

parisni commented Feb 1, 2024

Okay will double check thanks for reaching

@jonvex
Copy link
Contributor

jonvex commented Apr 29, 2024

@parisni update on this. I am working on getting vectorized reader enabled in more scenarios with the new fg reader and now that test works again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-enquiry issue contains feature enquiries/requests or great improvement ideas performance priority:major degraded perf; unable to move forward; potential bugs spark-sql
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants