Skip to content

HIVE-24343: Table partition operations (create, drop, select) fail wh…#1640

Closed
vnhive wants to merge 2 commits intoapache:masterfrom
vnhive:HIVE-24343
Closed

HIVE-24343: Table partition operations (create, drop, select) fail wh…#1640
vnhive wants to merge 2 commits intoapache:masterfrom
vnhive:HIVE-24343

Conversation

@vnhive
Copy link
Contributor

@vnhive vnhive commented Nov 2, 2020

…en the number of partitions is greater than 32767 (signed int)

What changes were proposed in this pull request?

The implementation of getPartitionIdsViaSqlFilter should be extended to account for the limit of 32767
on the number of parameters to the underlying JDBC driver.

Why are the changes needed?

The table partition operations - create, drop, select access the underlying relation database using JDO,
which internally routes the operations through the JDBC driver. Most of the underlying JDBC driver
implementations place a limit on the number of parameters that can be passed through a statement
implementation. These limits should be taken into account when querying the underlying metastore.

Does this PR introduce any user-facing change?

The patch will ensure that if more than 32767 operations are being operated on (added, dropped or queried)
at the same time, it will not fail.

How was this patch tested?

  1. The patch was verified by attaching the debugger and verifying the code flow
  2. The existing metastore tests passed.

// parameters being set in a executed statement as a 2 byte signed integer to the servers. This indirectly
// sets s limit on the number of parameters that can be set to 32767. This is also the number of parameters
// that can be passed to the JDO executeArray call.
final int JDBC_STMT_PARAM_LIMIT = 32767;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 thoughts on this

  1. Can we move this outside this method to the top of the class?
  2. I understand that based on the research all DBs support atleast 32767 parameters. But do we really know that it works? We dont have a test case for this that tests across all databases nor have we tested this, afaik. Also we are starting to support a pluggable model for DB support. So users can use HMS with other DBs as well for example DB2 (all they have to do is provide the schema scripts). So we don't know what their limits would be.
    Would it make sense to use a smaller number? what is the downside of doing so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I am OK with this. Will do this and resubmit the patch.
  1. But do we really know that it works? We dont have a test case for this that tests across all databases nor have we
    tested this, afaik.

Agreed. I will look into creating a test. Insofar as the issue itself is concerned they seem to be documented limitations of the JDBC driver and the solution is to break the original query into multiple queries. Please do take a look at - https://www.postgresql.org/message-id/16832734.post%40talk.nabble.com

We dont have a test case for this that tests across all databases nor have we tested this, afaik.

Agreed. I can work on this.

Also we are starting to support a pluggable model for DB support. So users can use HMS with other DBs as well for
example DB2 (all they have to do is provide the schema scripts). So we don't know what their limits would be.
Would it make sense to use a smaller number? what is the downside of doing so?

Thank you for explaining about the pluggable model for DB support. We can consider a smaller size for the parameters that are passed to the query.

The downside of reducing the number of parameters that are passed would be that it can potentially result in increases the number of iterations required to get the result of the query for the same number of parameters.

for example if there are 30000 parameters, with a size of 32767, it can be executed in a single query. But if we set the size to 1000, it will result in 30000/1000 = 30 iterations.

This will result in increased latency.

It is important to note that this part of the code is executed for all DDL, DML and queries that involved partitioned tables. Hence it has potential to increase the latency across the spectrum.

for (int i = 0; i < paramsForFilter.size(); ++i) {
params[i + 3] = paramsForFilter.get(i);
}
"select " + PARTITIONS + ".\"PART_ID\" from " + PARTITIONS + ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-indent this code to original spacing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do ! Will make the change and resubmit.

// If this method has been called by default we need to query the tables once. Hence set the minimum numbner of
// iterations to 1.
int iterations = 1;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra blank lines. max of 1 blank line between code blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do ! Will make the change and resubmit.

// Find the number of iterations required from the number of input parameters and sets of 32764 parameters
// that can be created from it.
iterations = parametersToProcess / (JDBC_STMT_PARAM_LIMIT - 3);
iterations += ((parametersToProcess % (JDBC_STMT_PARAM_LIMIT - 3) != 0) ? 1 : 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be able to merge these 2 statements. iterations is initialized to 1. Within this conditional block, if we just added the result of integer division to the value of iterations, should be good right? This code is only executed when the params is greater than the limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used his code to distinguish the case when the number of parameters is a multiple of JDBC_STMT_PARAM_LIMIT - 3, vs when it is not.

For example,

when parametersToProcess = 32764 * 2 => iterations = 2

However

when parametersToProcess = 32764 * 2 + 11 => iterations = 3

32764 parameters in the first iteration
32764 parameters in the second iteration
11 parameters in the third iteration

}
result.add(MetastoreDirectSqlUtils.extractSqlLong(fields));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do ! Will make the change and resubmit.

break;
}
}
result.add(MetastoreDirectSqlUtils.extractSqlLong(fields));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think so but I am asking it anyways. Is it possible to have duplicates across sub-queries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the query that is executed,

select "PARTITIONS"."PART_ID" from "PARTITIONS" inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = ? inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = ? inner join "PARTITION_KEY_VALS" "FILTER0" on "FILTER0"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER0"."INTEGER_IDX" = 0 where "DBS"."CTLG_NAME" = ? and ( (("FILTER0"."PART_KEY_VAL" = ?) or (("FILTER0"."PART_KEY_VAL" = ?) or (("FILTER0"."PART_KEY_VAL" = ?) or ("FILTER0"."PART_KEY_VAL" = ?)) ) ) )

We pass in unique values to PART_KEY_VAL, hence this ideally should not result in duplicates.

…en the number of partitions is greater than 32767 (signed int)
…en the number of partitions is greater than 32767 (signed int)

Addressed comments
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label Jan 26, 2021
@github-actions github-actions bot closed this Feb 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants