HIVE-24343: Table partition operations (create, drop, select) fail wh…#1640

Closed

vnhive wants to merge 2 commits intoapache:masterfrom

vnhive:HIVE-24343

Contributor

vnhive commented Nov 2, 2020

…en the number of partitions is greater than 32767 (signed int)

What changes were proposed in this pull request?

The implementation of getPartitionIdsViaSqlFilter should be extended to account for the limit of 32767
on the number of parameters to the underlying JDBC driver.

Why are the changes needed?

The table partition operations - create, drop, select access the underlying relation database using JDO,
which internally routes the operations through the JDBC driver. Most of the underlying JDBC driver
implementations place a limit on the number of parameters that can be passed through a statement
implementation. These limits should be taken into account when querying the underlying metastore.

Does this PR introduce any user-facing change?

The patch will ensure that if more than 32767 operations are being operated on (added, dropped or queried)
at the same time, it will not fail.

How was this patch tested?

The patch was verified by attaching the debugger and verifying the code flow
The existing metastore tests passed.

kgyrtkirk added tests pending tests unstable and removed tests pending labels

vnhive force-pushed the HIVE-24343 branch from f3c5e03 to aa66d08 Compare

November 3, 2020 08:32

kgyrtkirk added tests pending and removed tests unstable labels

vnhive force-pushed the HIVE-24343 branch from aa66d08 to ddfcb33 Compare

November 3, 2020 08:36

kgyrtkirk added tests failed tests pending tests passed and removed tests pending tests failed labels

nrg4878 requested changes

View reviewed changes

...tore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java Outdated

+                  // parameters being set in a executed statement as a 2 byte signed integer to the servers. This indirectly
+                  // sets s limit on the number of parameters that can be set to 32767. This is also the number of parameters
+                  // that can be passed to the JDO executeArray call.
+                  final int JDBC_STMT_PARAM_LIMIT = 32767;

Contributor

nrg4878 Nov 9, 2020

2 thoughts on this

Can we move this outside this method to the top of the class?
I understand that based on the research all DBs support atleast 32767 parameters. But do we really know that it works? We dont have a test case for this that tests across all databases nor have we tested this, afaik. Also we are starting to support a pluggable model for DB support. So users can use HMS with other DBs as well for example DB2 (all they have to do is provide the schema scripts). So we don't know what their limits would be.
Would it make sense to use a smaller number? what is the downside of doing so?

Contributor Author

vnhive Nov 11, 2020

I am OK with this. Will do this and resubmit the patch.

But do we really know that it works? We dont have a test case for this that tests across all databases nor have we
tested this, afaik.

Agreed. I will look into creating a test. Insofar as the issue itself is concerned they seem to be documented limitations of the JDBC driver and the solution is to break the original query into multiple queries. Please do take a look at - https://www.postgresql.org/message-id/16832734.post%40talk.nabble.com

We dont have a test case for this that tests across all databases nor have we tested this, afaik.

Agreed. I can work on this.

Also we are starting to support a pluggable model for DB support. So users can use HMS with other DBs as well for
example DB2 (all they have to do is provide the schema scripts). So we don't know what their limits would be.
Would it make sense to use a smaller number? what is the downside of doing so?

Thank you for explaining about the pluggable model for DB support. We can consider a smaller size for the parameters that are passed to the query.

The downside of reducing the number of parameters that are passed would be that it can potentially result in increases the number of iterations required to get the result of the query for the same number of parameters.

for example if there are 30000 parameters, with a size of 32767, it can be executed in a single query. But if we set the size to 1000, it will result in 30000/1000 = 30 iterations.

This will result in increased latency.

It is important to note that this part of the code is executed for all DDL, DML and queries that involved partitioned tables. Hence it has potential to increase the latency across the spectrum.

...tore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java

-                  for (int i = 0; i < paramsForFilter.size(); ++i) {
-                    params[i + 3] = paramsForFilter.get(i);
-                  }
+                          "select " + PARTITIONS + ".\"PART_ID\" from " + PARTITIONS + ""

Contributor

nrg4878 Nov 9, 2020

re-indent this code to original spacing

Contributor Author

vnhive Nov 11, 2020

Will do ! Will make the change and resubmit.

...tore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java

+                  // If this method has been called by default we need to query the tables once. Hence set the minimum numbner of
+                  // iterations to 1.
+                  int iterations = 1;

Contributor

nrg4878 Nov 9, 2020

remove extra blank lines. max of 1 blank line between code blocks.

Contributor Author

vnhive Nov 11, 2020

Will do ! Will make the change and resubmit.

...tore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java

+                    // Find the number of iterations required from the number of input parameters and sets of 32764 parameters
+                    // that can be created from it.
+                    iterations = parametersToProcess / (JDBC_STMT_PARAM_LIMIT - 3);
+                    iterations += ((parametersToProcess % (JDBC_STMT_PARAM_LIMIT - 3) != 0) ? 1 : 0);

Contributor

nrg4878 Nov 9, 2020

we should be able to merge these 2 statements. iterations is initialized to 1. Within this conditional block, if we just added the result of integer division to the value of iterations, should be good right? This code is only executed when the params is greater than the limit.

Contributor Author

vnhive Nov 11, 2020

I used his code to distinguish the case when the number of parameters is a multiple of JDBC_STMT_PARAM_LIMIT - 3, vs when it is not.

For example,

when parametersToProcess = 32764 * 2 => iterations = 2

However

when parametersToProcess = 32764 * 2 + 11 => iterations = 3

32764 parameters in the first iteration
32764 parameters in the second iteration
11 parameters in the third iteration

...tore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java Outdated

+                      }
+                      result.add(MetastoreDirectSqlUtils.extractSqlLong(fields));
+                    }

Contributor

nrg4878 Nov 9, 2020

remove blank line

Contributor Author

vnhive Nov 11, 2020

Will do ! Will make the change and resubmit.

...tore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java

+                          break;
+                        }
+                      }
+                      result.add(MetastoreDirectSqlUtils.extractSqlLong(fields));

Contributor

nrg4878 Nov 9, 2020

I dont think so but I am asking it anyways. Is it possible to have duplicates across sub-queries?

Contributor Author

vnhive Nov 11, 2020

This is the query that is executed,

select "PARTITIONS"."PART_ID" from "PARTITIONS" inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = ? inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = ? inner join "PARTITION_KEY_VALS" "FILTER0" on "FILTER0"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER0"."INTEGER_IDX" = 0 where "DBS"."CTLG_NAME" = ? and ( (("FILTER0"."PART_KEY_VAL" = ?) or (("FILTER0"."PART_KEY_VAL" = ?) or (("FILTER0"."PART_KEY_VAL" = ?) or ("FILTER0"."PART_KEY_VAL" = ?)) ) ) )

We pass in unique values to PART_KEY_VAL, hence this ideally should not result in duplicates.

vnhive force-pushed the HIVE-24343 branch from ddfcb33 to b4a88d1 Compare

November 11, 2020 12:13

kgyrtkirk added tests pending tests passed and removed tests passed tests pending labels

vnhive added 2 commits

November 26, 2020 13:52


          HIVE-24343: Table partition operations (create, drop, select) fail wh…

e9c7b28

…en the number of partitions is greater than 32767 (signed int)


          HIVE-24343: Table partition operations (create, drop, select) fail wh…

6a0c039

…en the number of partitions is greater than 32767 (signed int)

Addressed comments

vnhive force-pushed the HIVE-24343 branch from b4a88d1 to 6a0c039 Compare

November 26, 2020 08:30

kgyrtkirk added tests pending tests passed and removed tests passed tests pending labels

github-actions bot commented Jan 26, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

github-actions bot added the stale label

github-actions bot closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale tests passed