-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-10841][SQL] Add pushdown support of UDF for Parquet #8922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #43048 has finished for PR 8922 at commit
|
|
Test build #43049 has finished for PR 8922 at commit
|
|
retest this please. |
|
Test build #43050 has finished for PR 8922 at commit
|
|
Test build #43061 has finished for PR 8922 at commit
|
|
Test build #43146 has finished for PR 8922 at commit
|
|
Test build #43147 has finished for PR 8922 at commit
|
|
ping @liancheng @marmbrus |
|
I'm a little skeptical that this is worth the complexity. Do you have real works loads that this speeds up significantly? |
|
I will post performance comparison later. |
|
Test build #43175 has finished for PR 8922 at commit
|
|
I've used daily sql query to test the performance difference. Roughly, based on on Spark 1.4.1 backported with this patch, it shows about relative 20% improvement. The actual improvement may vary depending on the number of columns filtered by and how many records actually pass the filter. More importantly, I think this patch should be able to filter unnecessary data in advance, and thus reduce the memory usage. |
|
Test build #43214 has finished for PR 8922 at commit
|
|
"daily sql query" is not sufficiently descriptive. Please post actual benchmark results with code when making pull requests that claim to improve performance. It would also be good to evaluate the cost in degenerate cases. For example, I think you are adding an object allocation per input tuple when boxing for any queries that filter by UDF in parquet. Are you slowing down cases where the filter is not selective? If we want to improve the set of things that we push down, I don't think specializing for just UDFs in comparison operations is worth it given how much you are widening the API. Could we just have a single Function filter: case class FilterFunction(attribute: String, function: Any => Boolean)or maybe some specialized variants: case class IntegerFilter(attribute: String, function: Int => Boolean)
... |
|
Hmm, because the sql query and data schema is sensitive for company business, I may not be able to post publicly here. The data size is hundreds GB to 1TB, and the sql query is roughly selecting dozen of columns from the table with few filters involving UDFs and lateral view and group by. I just realized that we don't need to do You suggestion is correct. But for our cases, because these existing UDFs are not always with signatures such as The above is the reason why I designed the API as it in this patch. If you still think this API is too general to use, I can update it as you suggest. |
|
Test build #43263 has finished for PR 8922 at commit
|
|
I'm not suggesting we specialize for UDFs that are Int => Boolean. I'm suggesting that we generate a function for any predicate that only takes a single attribute as input and then pass that generated function as a filter to the data sources. That way this will work for more than just simple comparisons with UDFs. We can do this for any predicates that are not matched by the other filter patterns. |
|
Also, I understand you can't share your internal data/code, but you can create a similar benchmark on synthetic data. |
|
@marmbrus Thanks for clarifying that. I have scanned quickly the predicates in expressions. Actually, seems that I can't find any predicate that only takes a single attribute as input and is not matched to be pushed down to Parquet. Looks like UDFs are the only target we need to address? |
|
The point I'm trying to make is that we should generalize this, so that we don't have to special case every possible Your implementation, for example, doesn't handle: Here is a very rough sketch: case class FilterFunction(func: Any => Boolean) extends Filter
protected[sql] def selectFilters(filters: Seq[Expression]) = {
filters.match {
...
case e: Expression if e.references.size == 1 =>
val boundExpression = BindReferences.bindReference(e, e.references.toSeq)
Some(FilterFunction(
(a: Any) => {
val inputRow = new GenericInternalRow(Array(a))
boundExpression.eval(inputRow).asInstanceOf[Boolean]
}))
}
}There are a bunch of things that need to be done though before we could commit this though:
|
|
This implementation only considers the use case to evaluate a single attribute with an UDF and compare the result with a literal value. We only consider this because in the current implementation of Your proposal looks good and very general. However, I am little worrying the performance regressions brought by creating a row for each input value and evaluating on the row. It should be slower than original built-in Parquet filters we are using currently. This patch helps us reduce the memory footprint required for loading lot of data from Parquet files. For performance, the improvement is not significant but competes with the case of not pushing down at least. I agree that this patch introduces additional complexity to the API. If you still think it is not worth, I will close this PR first. Thanks for reviewing and suggestion. |
JIRA: https://issues.apache.org/jira/browse/SPARK-10841
Currently we can't push down filters involving UDFs to Parquet. In practice, we have some usage of UDFs in filters, e.g.,
In above query,
customer_idis a column storing customer id in some way.udfis a function used to parse this column to string value. Without pushing down the filter to Parquet, we will fetch all data from many Parquet files and then perform filtering in Spark.Using Parquet's
UserDefinedPredicate, we can push down these filters to Parquet.