HIVE-26144: Add keys/indexes to support highly concurrent workload #3214

kovjanos · 2022-04-15T19:24:43Z

What changes were proposed in this pull request?

Missing keys/index is to be added to the HMS backend db schema

Why are the changes needed?

On a high-concurrency test we found that backend database is doing full table scans in some cases where the table has missing key/index.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Integration tests for all database types:

[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hive.metastore.dbinstall.ITestPostgres
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 13.949 s - in org.apache.hadoop.hive.metastore.dbinstall.ITestPostgres
[INFO] Running org.apache.hadoop.hive.metastore.dbinstall.ITestMssql
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 33.711 s - in org.apache.hadoop.hive.metastore.dbinstall.ITestMssql
[INFO] Running org.apache.hadoop.hive.metastore.dbinstall.ITestMysql
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 39.032 s - in org.apache.hadoop.hive.metastore.dbinstall.ITestMysql
[INFO] Running org.apache.hadoop.hive.metastore.dbinstall.ITestDerby
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.101 s - in org.apache.hadoop.hive.metastore.dbinstall.ITestDerby
[INFO] Running org.apache.hadoop.hive.metastore.dbinstall.ITestOracle
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 94.393 s - in org.apache.hadoop.hive.metastore.dbinstall.ITestOracle
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0

zabetak

Thanks for the PR @kovjanos. I left some general questions under the JIRA.

Also, I am wondering if you tested the upgrade scripts with tables having data. What happens to existing rows when you alter the table to introduce the new PK column? Are they populated automatically?

kovjanos · 2022-04-16T17:37:45Z

Also, I am wondering if you tested the upgrade scripts with tables having data. What happens to existing rows when you alter the table to introduce the new PK column? Are they populated automatically?

mysql part (on MariaDB) was tested in production environment. Let me add the other tests and results here...

kovjanos · 2022-04-16T21:04:53Z

@zabetak Tested with Derby, MySQL, PostgreSQL and MSSQL and worked - see below.
Oracle doesn't work, I need to review again.

Versions tested:

az82/docker-derby
mysql:5.7
postgres:11.6
mcr.microsoft.com/mssql/server:2019-latest

Derby

CREATE TABLE TEST_TABLE (
  ID  bigint NOT NULL,
  DB  varchar(10) NOT NULL
);
CREATE INDEX TEST_TABLE_IDX ON TEST_TABLE (DB);
INSERT INTO TEST_TABLE VALUES (1,'1-db'),(2,'2-db'),(3,'3-db');
ALTER TABLE TEST_TABLE ADD COLUMN PKEY bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY;
SELECT * FROM TEST_TABLE;
INSERT INTO TEST_TABLE (ID,DB) VALUES (4,'4-db'),(5,'5-db'),(6,'6-db');
SELECT * FROM TEST_TABLE;

PostgreSQL:

CREATE TABLE "TEST_TABLE" (
  "ID"  bigint NOT NULL,
  "DB"  varchar(10) NOT NULL);
CREATE INDEX TEST_TABLE_IDX ON "TEST_TABLE" USING btree ("DB");
INSERT INTO "TEST_TABLE" VALUES (1,'1-db'),(2,'2-db'),(3,'3-db');
ALTER TABLE "TEST_TABLE" ADD "PKEY" bigserial PRIMARY KEY;
SELECT * FROM "TEST_TABLE";
INSERT INTO "TEST_TABLE" VALUES (4,'4-db'),(5,'5-db'),(6,'6-db');
SELECT * FROM "TEST_TABLE";

MySQL:

CREATE TABLE TEST_TABLE (
  ID  bigint NOT NULL,
  DB  varchar(10) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE INDEX TEST_TABLE_IDX ON TEST_TABLE (DB) USING BTREE;
INSERT INTO TEST_TABLE VALUES (1,'1-db'),(2,'2-db'),(3,'3-db');
ALTER TABLE TEST_TABLE ADD COLUMN PKEY BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT FIRST;
SELECT * FROM TEST_TABLE;
INSERT INTO TEST_TABLE (ID,DB) VALUES (4,'4-db'),(5,'5-db'),(6,'6-db');
SELECT * FROM TEST_TABLE;

MSSQL:

CREATE TABLE TEST_TABLE (
  ID  bigint NOT NULL,
  DB  varchar(10) NOT NULL
);
CREATE INDEX TEST_TABLE_IDX ON TEST_TABLE (DB);
INSERT INTO TEST_TABLE VALUES (1,'1-db'),(2,'2-db'),(3,'3-db');
ALTER TABLE TEST_TABLE ADD PKEY bigint NOT NULL IDENTITY(1,1) PRIMARY KEY;
SELECT * FROM TEST_TABLE;
INSERT INTO TEST_TABLE VALUES (1,'1-db'),(2,'2-db'),(3,'3-db');
SELECT * FROM TEST_TABLE;

kovjanos · 2022-04-16T21:37:51Z

That doesn't work with Oracle, even plain table has uniqueness issues - see below. Do we need to have this also on ORACLE? I can add the column to keep schema consistent, but it can't be a PRIMARY KEY (might not even be required for the DELETEs in Oracle if it works differently).
One solution might be the old-school solution: a sequence and trigger behind the increment column. I'll give it a try..

CREATE TABLE TEST_TABLE (
  ID  number(19) NOT NULL,
  DB  varchar(10) NOT NULL,
  PKEY NUMBER(19) GENERATED ALWAYS AS IDENTITY 
) ROWDEPENDENCIES;
INSERT ALL
  INTO TEST_TABLE (ID, DB) VALUES (1,'1-db')
  INTO TEST_TABLE (ID, DB) VALUES (2,'2-db')
  INTO TEST_TABLE (ID, DB) VALUES (3,'3-db')
SELECT 1 FROM DUAL;
SELECT * FROM TEST_TABLE;

  ID DB         PKEY
---- ---------- ----
   1 1-db          1
   2 2-db          1
   3 3-db          1

…ache#3

standalone-metastore/metastore-server/src/main/sql/derby/hive-schema-4.0.0-alpha-2.derby.sql

deniskuzZ · 2022-06-06T14:23:33Z

standalone-metastore/metastore-server/src/main/sql/derby/hive-schema-4.0.0-alpha-2.derby.sql

@@ -570,7 +571,8 @@ CREATE TABLE COMPLETED_TXN_COMPONENTS (
  CTC_PARTITION varchar(767),
  CTC_TIMESTAMP timestamp DEFAULT CURRENT_TIMESTAMP NOT NULL,
  CTC_WRITEID bigint,
-  CTC_UPDATE_DELETE char(1) NOT NULL
+  CTC_UPDATE_DELETE char(1) NOT NULL,
+  CTC_ID bigint PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY


same as above

Thanks for the review @deniskuzZ! Based on the above tests, that's only a problem for Oracle as others - Derby, PgSQL, MySQL, MSSQL - all generate uniq value for the multi-line inserts. As soon as out from other ticket I'll test the Oracle case if the cleaner queries do better plans or not with the identity to see if a sequence based column would be needed or just an identity column to keep schema consistent across all engines.

oh, sorry that's not a TXN_ID column, might be worse to check HIVE-23048: Use sequences for TXN_ID generation.
What are the problematic queries that could benefit from TC_ID/CTC_ID PK?

github-actions · 2022-08-07T00:26:12Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

Janos Kovacs added 2 commits April 15, 2022 21:10

Add missing keys/indexes for high concurrency workload

271dc26

HIVE-26144: Add missing keys/indexes for high concurrency workload

4c58c3c

kgyrtkirk added tests pending tests passed and removed tests pending labels Apr 15, 2022

zabetak reviewed Apr 16, 2022

View reviewed changes

HIVE-26144: Add missing keys/indexes for high concurrency workload ap…

b3950c3

…ache#3

kgyrtkirk added tests pending tests passed and removed tests passed tests pending labels Apr 16, 2022

pvary requested review from deniskuzZ and klcopp April 25, 2022 10:50

deniskuzZ reviewed Jun 6, 2022

View reviewed changes

standalone-metastore/metastore-server/src/main/sql/derby/hive-schema-4.0.0-alpha-2.derby.sql Show resolved Hide resolved

deniskuzZ reviewed Jun 6, 2022

View reviewed changes

github-actions bot added the stale label Aug 7, 2022

github-actions bot closed this Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-26144: Add keys/indexes to support highly concurrent workload #3214

HIVE-26144: Add keys/indexes to support highly concurrent workload #3214

kovjanos commented Apr 15, 2022

zabetak left a comment

kovjanos commented Apr 16, 2022

kovjanos commented Apr 16, 2022

kovjanos commented Apr 16, 2022

deniskuzZ Jun 6, 2022

kovjanos Jun 7, 2022

deniskuzZ Jun 7, 2022

github-actions bot commented Aug 7, 2022

HIVE-26144: Add keys/indexes to support highly concurrent workload #3214

HIVE-26144: Add keys/indexes to support highly concurrent workload #3214

Conversation

kovjanos commented Apr 15, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zabetak left a comment

Choose a reason for hiding this comment

kovjanos commented Apr 16, 2022

kovjanos commented Apr 16, 2022

kovjanos commented Apr 16, 2022

deniskuzZ Jun 6, 2022

Choose a reason for hiding this comment

kovjanos Jun 7, 2022

Choose a reason for hiding this comment

deniskuzZ Jun 7, 2022

Choose a reason for hiding this comment

github-actions bot commented Aug 7, 2022