[SPARK-33187][SQL] Add a check on the number of returned partitions in the HiveShim#getPartitionsByFilter method #30225

jinhai-cloud · 2020-11-02T10:30:34Z

What changes were proposed in this pull request?

Add a check on the number of returned metastore partitions by calling Hive#getNumPartitionsByFilter, and add SQLConf spark.sql.hive.metastorePartitionLimit, default value is 100_000

Why are the changes needed?

In the method Shim#getPartitionsByFilter, when filter is empty or when the hive table has a large number of partitions, calling getAllPartitionsMethod or getPartitionsByFilterMethod will results in Driver OOM.

Does this PR introduce any user-facing change?

No

How was this patch tested?

This change is already covered by existing tests

…rtitions

AmplabJenkins · 2020-11-02T10:33:16Z

Can one of the admins verify this patch?

…veShim_v2_1

…fig tryDirectSql is false

maropu · 2020-11-10T06:29:53Z

Could you add tests? btw, does hive have a similar config?

wangyum · 2020-11-10T07:57:16Z

Is it a good design if partition number larger than 100000?

jinhai-cloud · 2020-11-11T02:22:55Z

Is it a good design if partition number larger than 100000?

User queries generally do not exceed 100000 partitions. It's just that filter can't filter partitions very well.
For example, the expression is of a different type, issue: https://issues.apache.org/jira/browse/SPARK-15287.
And Presto also has a parameter to limit the number of partitions in HiveMetadata#getPartitionsAsList

jinhai-cloud · 2020-11-11T02:33:30Z

Could you add tests? btw, does hive have a similar config?

I'll think about adding a test.
HiveMetaStore#get_partitions_by_filter also call method checkLimitNumberOfPartitionsByFilter，and HiveConf#METASTORE_LIMIT_PARTITION_REQUEST default value is -1.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

wangyum

Could you verify is hive.metastore.limit.partition.request works? if it works, we can add it to document. We have added a Hadoop config before: mapreduce.fileoutputcommitter.algorithm.version.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sunchao · 2020-11-20T00:43:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .version("3.1.0")
+      .intConf
+      .checkValue(_ >= -1, "The maximum must be a positive integer, -1 to follow the Hive config.")
+      .createWithDefault(100000)


have you considered to keep the default behavior as it is but allow it to be configurable? changing it means now we'll need to make two HMS calls (one additional getNumPartitionsByFilter) which I'm not sure is desirable (have seen HMS perform very badly in production before).

@sunchao Thank you for your response. I think this is a reasonable maximum, and Presto also has a parameter to limit the number of partitions in HiveMetadata#getPartitionsAsList, default value is 100_000

Yea the default value 100_000 looks fine to me. My main question is whether we need to make the default value to be that and double the HMS calls. Presto doesn't call getNumPartitionsByFilter it seems as it streams through a partition iterator and stops once the threshold is reached.

jinhai-cloud · 2020-12-03T08:07:57Z

Could you verify is hive.metastore.limit.partition.request works? if it works, we can add it to document. We have added a Hadoop config before: mapreduce.fileoutputcommitter.algorithm.version.

@wangyum It does not work on the client

jinhai-cloud · 2020-12-17T08:43:02Z

@sunchao @wangyum Can you help me verify this patch?

github-actions · 2021-03-28T00:18:33Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

[SPARK-33187][SQL] Add a check on the number of returned metastore pa…

b02f28f

…rtitions

jinhai-cloud added 3 commits November 3, 2020 11:09

[SPARK-33187][SQL] Modify the check on the num of partitions after Hi…

3ac77f8

…veShim_v2_1

[SPARK-33187][SQL] Catch getNumPartitionsByFilterMethod when hive con…

e5411f4

…fig tryDirectSql is false

[SPARK-33187][SQL] Modify exception message

559a9bc

maropu changed the title ~~[SPARK-33187][SQL] Add a check on the number of returned metastore pa…~~ [SPARK-33187][SQL] Add a check on the number of returned partitions in the HiveShim#getPartitionsByFilter method Nov 10, 2020