-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the configuration for bloom filters #2464
Comments
Yuming Wang / @wangyum: val numRows = 1024 * 1024 * 15
val df = spark.range(numRows).selectExpr(
"id",
"cast(id as string) as s",
"cast(id as timestamp) as ts",
"cast(cast(id as timestamp) as date) as td",
"cast(id as decimal) as dec")
val benchmark = new org.apache.spark.benchmark.Benchmark(
"Benchmark bloom filter write",
numRows,
minNumIters = 5)
Seq(false, true).foreach { pushDownEnabled =>
val name = s"Write parquet ${if (pushDownEnabled) s"(bloom filter)" else ""}"
benchmark.addCase(name) { _ =>
withSQLConf(org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> s"$pushDownEnabled") {
df.write.mode("overwrite").parquet("/tmp/spark/parquet")
}
}
}
benchmark.run()
|
Gabor Szadovszky / @gszadovszky: I am not an expert of this feature and maybe we can improve the writing performance but generating bloom filters will have performance impact. It is up to the user to decide if this impact worth for the potential benefit at read time. That's why it is highly suggested to specify which exact columns are the bloom filters required for and also to specify the other parameters for bloom filter. @chenjunjiedada, any comments on this? |
Junjie Chen / @chenjunjiedada: |
Gabor Szadovszky / @gszadovszky: |
Yuming Wang / @wangyum: set parquet.bloom.filter.enabled=false;
set parquet.bloom.filter.enabled#ts=true;
set parquet.bloom.filter.enabled#dec=true; Benchmark and benchmark result: val numRows = 1024 * 1024 * 15
val df = spark.range(numRows).selectExpr(
"id",
"cast(id as string) as s",
"cast(id as timestamp) as ts",
"cast(cast(id as timestamp) as date) as td",
"cast(id as decimal) as dec")
val benchmark = new org.apache.spark.benchmark.Benchmark(
"Benchmark bloom filter write",
numRows,
minNumIters = 5)
benchmark.addCase("default") { _ =>
withSQLConf() {
df.write.mode("overwrite").parquet("/tmp/spark/parquet")
}
}
benchmark.addCase("Build bloom filter for ts column") { _ =>
withSQLConf(
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> "false",
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#ts" -> "true") {
df.write.mode("overwrite").parquet("/tmp/spark/parquet")
}
}
benchmark.addCase("Build bloom filter for ts and dec column") { _ =>
withSQLConf(
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> "false",
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#ts" -> "true",
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#dec" -> "true") {
df.write.mode("overwrite").parquet("/tmp/spark/parquet")
}
}
benchmark.addCase("Build bloom filter for all column") { _ =>
withSQLConf(
org.apache.parquet.hadoop.ParquetOutputFormat.BLOOM_FILTER_ENABLED -> "true") {
df.write.mode("overwrite").parquet("/tmp/spark/parquet")
}
}
benchmark.run()
|
Gabor Szadovszky / @gszadovszky: |
Yuming Wang / @wangyum: |
Refactor the hadoop configuration for bloom filters according to PARQUET-1784.
Reporter: Gabor Szadovszky / @gszadovszky
Assignee: Gabor Szadovszky / @gszadovszky
Related issues:
PRs and other links:
Note: This issue was originally created as PARQUET-1805. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: