PHOENIX-7014: Query compiler/optimizer changes along with some PHOENIX-7015 changes #1766

haridsv · 2023-12-14T13:56:15Z

This PR replaces #1738 to include some experimental scanner changes for PHOENIX-7015.

This has the changes needed for PHOENIX-7013 and PHOENIX-7014. It short-circuits the existing optimizer logic to produce a plan that is suitable for CDC queries. It also reimplements the query hints so that it is simpler and easier to maintain using regex. It includes a PoC query test and some debug test cases for short term reference (can be removed before merging to master).

This PR also goes one step further to generating a mock JSON string (PoC in #1738) to be able to generate results from the actual data (though, not per the expected format). I temporary added Gson as a dependency to generate the result, but this will be replaced with BSON once the JSON support is available. Here is a quick demo of how this works:

0: jdbc:phoenix:localhost> create table t (k INTEGER PRIMARY KEY, v1 INTEGER);
No rows affected (0.651 seconds)
0: jdbc:phoenix:localhost> upsert into t (k, v1) VALUES (1, 100);
1 row affected (0.067 seconds)
0: jdbc:phoenix:localhost> upsert into t (k, v1) VALUES (2, 200);
1 row affected (0.067 seconds)
0: jdbc:phoenix:localhost> upsert into t (k, v1) VALUES (1, 101);
1 row affected (0.067 seconds)
0: jdbc:phoenix:localhost> create cdc tchanges  on t(PHOENIX_ROW_TIMESTAMP());
2 rows affected (5.686 seconds)
0: jdbc:phoenix:localhost> select * from tchanges;
+--------------------------+---+------------+
|  PHOENIX_ROW_TIMESTAMP() | K |  CDC JSON  |
+--------------------------+---+------------+
| 2023-12-06               | 1 | {"V1":100} |
| 2023-12-06               | 2 | {"V1":200} |
| 2023-12-06               | 1 | {"V1":101} |
+--------------------------+---+------------+
3 rows selected (0.022 seconds

… debug code and tests

…of the function definition

… TODO

… type of ProjectedColumnExpression instead of KeyValueColumnExpression

…ompletely. Make index as the physical table for CDC

…ENIX-7015

tkhurana · 2023-12-19T18:19:16Z

phoenix-core/src/main/java/org/apache/phoenix/coprocessor/CDCGlobalIndexRegionScanner.java

+    private Map<ImmutableBytesPtr, String> dataColQualNameMap;
+    private Map<ImmutableBytesPtr, PDataType> dataColQualTypeMap;
+    // Map<dataRowKey: Map<TS: Map<qualifier: Cell>>>
+    private Map<ImmutableBytesPtr, Map<Long, Map<ImmutableBytesPtr, Cell>>> dataRowChanges =


This is pretty complicated. Maybe you could introduce some abstraction which represents a change as a object. Then you could have a collection of changes.

This was meant to be a quick PoC sort of code and should go away as @TheNamesRai is already reworking the logic to build the row without having to use this intermediate data structure.

tkhurana · 2023-12-20T19:59:46Z

phoenix-core/src/it/java/org/apache/phoenix/end2end/CDCMiscIT.java

+        conn.createStatement().execute("UPSERT INTO " + tableName + " (k, v1) VALUES (1, 101)");
+        conn.commit();
+        String cdcName = generateUniqueName();
+        String cdc_sql = "CREATE CDC " + cdcName


Everytime we run CREATE CDC do we need to pass PHOENIX_ROW_TIMESTAMP ? If that is something mandatory we shouldn't make it part of the interface and the system should do it implicitly under the hood.

Also it seems the uncovered index that is being created as part of CREATE CDC is being created synchronously. That will only work for small tables. For bigger tables the index needs to be created async and then built explicitly using IndexTool.

Everytime we run CREATE CDC do we need to pass PHOENIX_ROW_TIMESTAMP ? If that is something mandatory we shouldn't make it part of the interface and the system should do it implicitly under the hood.

The feature allows any timestamp like column to be used instead of the PHOENIX_ROW_TIMESTMAP that is why this flexibility is being allowed, however I am not 100% sure about the use cases, I will have an offline discussion on this, thank you!

For bigger tables the index needs to be created async and then built explicitly using IndexTool.

If we do this, won't we also need to project the index status via CDC object, since we intend to keep the index hidden as an implementation detail? Do you think we should have an offline mode for CREATE CDC itself?

If we do this, won't we also need to project the index status via CDC object

I think that can be later improvement but supporting creation of async index is necessary because creating sync index for large table will anyways be too expensive or not possible.

If we do this, won't we also need to project the index status via CDC object

@haridsv I didn't quite understand what you mean by index status here ?

I didn't quite understand what you mean by index status here ?

I was referring to the async index creation lifecycle, which should be attributed to the CDC so that user can query and know when it is done.

@tkhurana In any case, I am not planning to address these as part of this PR, so are you OK to merge this as is?

OK, we can tackle that in a separate PR.

@tkhurana Async index creation is addressed as part of #1802.

Also removed the need to explicitly specify PHOENIX_ROW_TIMESTAMP() as we dropped the support for user specified timestamp column.

virajjasani · 2023-12-21T18:14:06Z

@tkhurana if you are good with the current state with any additional comments to be taken care of in subsequent PRs against feature branch (or final patch against master branch), we can merge this.

virajjasani · 2023-12-23T16:28:54Z

With follow-ups to come in separate PRs, merging this one.

haridsv · 2024-01-18T05:26:26Z

PR #1794 is the follow up that implements the query functionality to the spec.

Address the changes in the CREATE CDC spec in PHOENIX-7001 Also includes some review feedback changes across the changes in PRs apache#1681, apache#1703 and apache#1766

This commit includes a squash of all the below changes from the PRs apache#1662, apache#1694, down by removing non-functional changes for the ease of review. On top of it, the changes have been reworked for the changed state of master, especially the new submodules. * 4c9827a Hari.. PHOENIX-7008 Shallow grammar support for CREATE CDC (apache#1662) * e5220e0 TheN.. PHOENIX-7054 Shallow grammar support for DROP CDC and ALTER CDC (apache#1694) * 581e613 Hari.. PHOENIX-7008 Implementation for CREATE CDC (apache#1681) * e2ef886 Hari.. PHOENIX-7008 Tweaks, fixes and additional test coverage for CREATE CDC (apache#1703) * 5d3fd40 TheN.. PHOENIX-7074 DROP CDC Implementation (apache#1713) * 7420443 Hari.. PHOENIX-7014: Query compiler/optimizer changes along with some PHOENIX-7015 changes (apache#1766) * da6ddad Kadi.. Add an extra delete mutation for CDC * 93d586e Kadi.. Add an extra delete mutation during rebuild for CDC index * f07898f Hari.. PHOENIX-7008: Addressing Jira spec and review feedback changes (apache#1802) * 87a2ea1 Hari.. PHOENIX-7008: Fix for parser gap and fix for failing test (apache#1812) * e395780 TheN.. PHOENIX-7015 Implementing CDCGlobalIndexRegionScanner (apache#1813) Co-authored-by: Saurabh Rai <saurabh.rai@salesforce.com>

…ture This commit includes a squash of all the below changes from the PRs apache#1662, apache#1694, down by removing non-functional changes for the ease of review. On top of it, the changes have been reworked for the changed state of master, especially the new submodules. * 4c9827a Hari.. PHOENIX-7008 Shallow grammar support for CREATE CDC (apache#1662) * e5220e0 TheN.. PHOENIX-7054 Shallow grammar support for DROP CDC and ALTER CDC (apache#1694) * 581e613 Hari.. PHOENIX-7008 Implementation for CREATE CDC (apache#1681) * e2ef886 Hari.. PHOENIX-7008 Tweaks, fixes and additional test coverage for CREATE CDC (apache#1703) * 5d3fd40 TheN.. PHOENIX-7074 DROP CDC Implementation (apache#1713) * 7420443 Hari.. PHOENIX-7014: Query compiler/optimizer changes along with some PHOENIX-7015 changes (apache#1766) * da6ddad Kadi.. Add an extra delete mutation for CDC * 93d586e Kadi.. Add an extra delete mutation during rebuild for CDC index * f07898f Hari.. PHOENIX-7008: Addressing Jira spec and review feedback changes (apache#1802) * 87a2ea1 Hari.. PHOENIX-7008: Fix for parser gap and fix for failing test (apache#1812) * e395780 TheN.. PHOENIX-7015 Implementing CDCGlobalIndexRegionScanner (apache#1813) Co-authored-by: Saurabh Rai <saurabh.rai@salesforce.com>

haridsv added 25 commits November 27, 2023 21:35

Simplified hint parsing logic

c57f9b4

Set parent table info for CDC PTable

74c0c9e

Experimental client side compiler and optimizer changes with a lot of…

9bab359

… debug code and tests

Update PHOENIX_ROW_TIMESTAMP to DATE type to be consistent with that …

884db9e

…of the function definition

Remove debug code

79b5090

Add a new assert for unimplemented functionality and also implement a…

5291bbf

… TODO

Fixex for failing hint test scenarios

a9e5e33

Go through the uncovered index via hint

7ab3b95

A small deduping attempt

fce7ea6

Fix typo

66f99d2

Add test coverage to catch my earlier typo in BaseQueryPlan

c0f2910

Update test case after previous typo fix

6ec934f

Parking ad-hoc changes to get projection type with correct expression…

1e4ade2

… type of ProjectedColumnExpression instead of KeyValueColumnExpression

Removed previous CDC specific optimizer code and bypassed optimizer c…

291049e

…ompletely. Make index as the physical table for CDC

revert whitespace diff and remove some dead code.

3fc774b

Narrow down some more diff

e97ff2b

Got the mock data to come through to the client side as expected

2a6b4f9

Address a FIXME and remove some debug code

1f6b6fc

Recognize CDC hints and serialize into a scan attribute

002ddb6

Remove debug code

1a9d19e

Experimental CDC scanner with raw scan

d810886

Snapashot of experimental changes

a3a09b3

Snapshot of experimental changes with timeline

733420f

Merge remote-tracking branch 'upstream/PHOENIX-7001-feature' into PHO…

8fae9d6

…ENIX-7015

Switch from TreeMap to HashMap where possible

26f7356

haridsv mentioned this pull request Dec 14, 2023

PHOENIX-7014 Query compiler/optimizer changes for SELECT queries against CDC #1738

Closed

haridsv added 3 commits December 15, 2023 13:55

Reuse scanDataRows from base class

a9244c5

Address spotbugs

d9dce19

Fix an NPE and add more test cases

72faab4

tkhurana reviewed Dec 19, 2023

View reviewed changes

virajjasani self-requested a review December 19, 2023 22:40

tkhurana reviewed Dec 20, 2023

View reviewed changes

Clone byte arrays for additional scenarios

ea2d8fa

virajjasani merged commit 7420443 into apache:PHOENIX-7001-feature Dec 23, 2023
0 of 2 checks passed

haridsv mentioned this pull request Jan 22, 2024

PHOENIX-7008: Addressing Jira spec and review feedback changes #1802

Merged

haridsv mentioned this pull request Mar 29, 2024

PHOENIX-7001: Initial implementation of Change Data Capture (CDC) feature #1866

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHOENIX-7014: Query compiler/optimizer changes along with some PHOENIX-7015 changes #1766

PHOENIX-7014: Query compiler/optimizer changes along with some PHOENIX-7015 changes #1766

haridsv commented Dec 14, 2023 •

edited

tkhurana Dec 19, 2023

haridsv Dec 20, 2023

tkhurana Dec 20, 2023

tkhurana Dec 20, 2023

haridsv Dec 21, 2023

haridsv Dec 21, 2023

virajjasani Dec 21, 2023

tkhurana Dec 21, 2023

haridsv Dec 22, 2023

haridsv Dec 22, 2023

tkhurana Dec 22, 2023

haridsv Jan 24, 2024 •

edited

virajjasani commented Dec 21, 2023

virajjasani commented Dec 23, 2023

haridsv commented Jan 18, 2024

PHOENIX-7014: Query compiler/optimizer changes along with some PHOENIX-7015 changes #1766

PHOENIX-7014: Query compiler/optimizer changes along with some PHOENIX-7015 changes #1766

Conversation

haridsv commented Dec 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haridsv Jan 24, 2024 • edited

Choose a reason for hiding this comment

virajjasani commented Dec 21, 2023

virajjasani commented Dec 23, 2023

haridsv commented Jan 18, 2024

haridsv commented Dec 14, 2023 •

edited

haridsv Jan 24, 2024 •

edited