Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-26144: Add keys/indexes to support highly concurrent workload #3214

Closed
wants to merge 3 commits into from

Conversation

kovjanos
Copy link
Contributor

What changes were proposed in this pull request?

Missing keys/index is to be added to the HMS backend db schema

Why are the changes needed?

On a high-concurrency test we found that backend database is doing full table scans in some cases where the table has missing key/index.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Integration tests for all database types:

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hive.metastore.dbinstall.ITestPostgres
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 13.949 s - in org.apache.hadoop.hive.metastore.dbinstall.ITestPostgres
[INFO] Running org.apache.hadoop.hive.metastore.dbinstall.ITestMssql
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 33.711 s - in org.apache.hadoop.hive.metastore.dbinstall.ITestMssql
[INFO] Running org.apache.hadoop.hive.metastore.dbinstall.ITestMysql
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 39.032 s - in org.apache.hadoop.hive.metastore.dbinstall.ITestMysql
[INFO] Running org.apache.hadoop.hive.metastore.dbinstall.ITestDerby
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.101 s - in org.apache.hadoop.hive.metastore.dbinstall.ITestDerby
[INFO] Running org.apache.hadoop.hive.metastore.dbinstall.ITestOracle
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 94.393 s - in org.apache.hadoop.hive.metastore.dbinstall.ITestOracle
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0

Copy link
Contributor

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @kovjanos. I left some general questions under the JIRA.

Also, I am wondering if you tested the upgrade scripts with tables having data. What happens to existing rows when you alter the table to introduce the new PK column? Are they populated automatically?

@kovjanos
Copy link
Contributor Author

Also, I am wondering if you tested the upgrade scripts with tables having data. What happens to existing rows when you alter the table to introduce the new PK column? Are they populated automatically?

mysql part (on MariaDB) was tested in production environment. Let me add the other tests and results here...

@kovjanos
Copy link
Contributor Author

@zabetak Tested with Derby, MySQL, PostgreSQL and MSSQL and worked - see below.
Oracle doesn't work, I need to review again.

Versions tested:

az82/docker-derby
mysql:5.7
postgres:11.6
mcr.microsoft.com/mssql/server:2019-latest

Derby

CREATE TABLE TEST_TABLE (
  ID  bigint NOT NULL,
  DB  varchar(10) NOT NULL
);
CREATE INDEX TEST_TABLE_IDX ON TEST_TABLE (DB);
INSERT INTO TEST_TABLE VALUES (1,'1-db'),(2,'2-db'),(3,'3-db');
ALTER TABLE TEST_TABLE ADD COLUMN PKEY bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY;
SELECT * FROM TEST_TABLE;
INSERT INTO TEST_TABLE (ID,DB) VALUES (4,'4-db'),(5,'5-db'),(6,'6-db');
SELECT * FROM TEST_TABLE;

PostgreSQL:

CREATE TABLE "TEST_TABLE" (
  "ID"  bigint NOT NULL,
  "DB"  varchar(10) NOT NULL);
CREATE INDEX TEST_TABLE_IDX ON "TEST_TABLE" USING btree ("DB");
INSERT INTO "TEST_TABLE" VALUES (1,'1-db'),(2,'2-db'),(3,'3-db');
ALTER TABLE "TEST_TABLE" ADD "PKEY" bigserial PRIMARY KEY;
SELECT * FROM "TEST_TABLE";
INSERT INTO "TEST_TABLE" VALUES (4,'4-db'),(5,'5-db'),(6,'6-db');
SELECT * FROM "TEST_TABLE";

MySQL:

CREATE TABLE TEST_TABLE (
  ID  bigint NOT NULL,
  DB  varchar(10) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE INDEX TEST_TABLE_IDX ON TEST_TABLE (DB) USING BTREE;
INSERT INTO TEST_TABLE VALUES (1,'1-db'),(2,'2-db'),(3,'3-db');
ALTER TABLE TEST_TABLE ADD COLUMN PKEY BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT FIRST;
SELECT * FROM TEST_TABLE;
INSERT INTO TEST_TABLE (ID,DB) VALUES (4,'4-db'),(5,'5-db'),(6,'6-db');
SELECT * FROM TEST_TABLE;

MSSQL:

CREATE TABLE TEST_TABLE (
  ID  bigint NOT NULL,
  DB  varchar(10) NOT NULL
);
CREATE INDEX TEST_TABLE_IDX ON TEST_TABLE (DB);
INSERT INTO TEST_TABLE VALUES (1,'1-db'),(2,'2-db'),(3,'3-db');
ALTER TABLE TEST_TABLE ADD PKEY bigint NOT NULL IDENTITY(1,1) PRIMARY KEY;
SELECT * FROM TEST_TABLE;
INSERT INTO TEST_TABLE VALUES (1,'1-db'),(2,'2-db'),(3,'3-db');
SELECT * FROM TEST_TABLE;

@kovjanos
Copy link
Contributor Author

That doesn't work with Oracle, even plain table has uniqueness issues - see below. Do we need to have this also on ORACLE? I can add the column to keep schema consistent, but it can't be a PRIMARY KEY (might not even be required for the DELETEs in Oracle if it works differently).
One solution might be the old-school solution: a sequence and trigger behind the increment column. I'll give it a try..

CREATE TABLE TEST_TABLE (
  ID  number(19) NOT NULL,
  DB  varchar(10) NOT NULL,
  PKEY NUMBER(19) GENERATED ALWAYS AS IDENTITY 
) ROWDEPENDENCIES;
INSERT ALL
  INTO TEST_TABLE (ID, DB) VALUES (1,'1-db')
  INTO TEST_TABLE (ID, DB) VALUES (2,'2-db')
  INTO TEST_TABLE (ID, DB) VALUES (3,'3-db')
SELECT 1 FROM DUAL;
SELECT * FROM TEST_TABLE;

  ID DB         PKEY
---- ---------- ----
   1 1-db          1
   2 2-db          1
   3 3-db          1

@@ -570,7 +571,8 @@ CREATE TABLE COMPLETED_TXN_COMPONENTS (
CTC_PARTITION varchar(767),
CTC_TIMESTAMP timestamp DEFAULT CURRENT_TIMESTAMP NOT NULL,
CTC_WRITEID bigint,
CTC_UPDATE_DELETE char(1) NOT NULL
CTC_UPDATE_DELETE char(1) NOT NULL,
CTC_ID bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @deniskuzZ! Based on the above tests, that's only a problem for Oracle as others - Derby, PgSQL, MySQL, MSSQL - all generate uniq value for the multi-line inserts. As soon as out from other ticket I'll test the Oracle case if the cleaner queries do better plans or not with the identity to see if a sequence based column would be needed or just an identity column to keep schema consistent across all engines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry that's not a TXN_ID column, might be worse to check HIVE-23048: Use sequences for TXN_ID generation.
What are the problematic queries that could benefit from TC_ID/CTC_ID PK?

@github-actions
Copy link

github-actions bot commented Aug 7, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

@github-actions github-actions bot added the stale label Aug 7, 2022
@github-actions github-actions bot closed this Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants