Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHOENIX-7014: Query compiler/optimizer changes along with some PHOENIX-7015 changes #1766

Merged
merged 29 commits into from Dec 23, 2023

Conversation

haridsv
Copy link
Contributor

@haridsv haridsv commented Dec 14, 2023

This PR replaces #1738 to include some experimental scanner changes for PHOENIX-7015.

This has the changes needed for PHOENIX-7013 and PHOENIX-7014. It short-circuits the existing optimizer logic to produce a plan that is suitable for CDC queries. It also reimplements the query hints so that it is simpler and easier to maintain using regex. It includes a PoC query test and some debug test cases for short term reference (can be removed before merging to master).

This PR also goes one step further to generating a mock JSON string (PoC in #1738) to be able to generate results from the actual data (though, not per the expected format). I temporary added Gson as a dependency to generate the result, but this will be replaced with BSON once the JSON support is available. Here is a quick demo of how this works:

0: jdbc:phoenix:localhost> create table t (k INTEGER PRIMARY KEY, v1 INTEGER);
No rows affected (0.651 seconds)
0: jdbc:phoenix:localhost> upsert into t (k, v1) VALUES (1, 100);
1 row affected (0.067 seconds)
0: jdbc:phoenix:localhost> upsert into t (k, v1) VALUES (2, 200);
1 row affected (0.067 seconds)
0: jdbc:phoenix:localhost> upsert into t (k, v1) VALUES (1, 101);
1 row affected (0.067 seconds)
0: jdbc:phoenix:localhost> create cdc tchanges  on t(PHOENIX_ROW_TIMESTAMP());
2 rows affected (5.686 seconds)
0: jdbc:phoenix:localhost> select * from tchanges;
+--------------------------+---+------------+
|  PHOENIX_ROW_TIMESTAMP() | K |  CDC JSON  |
+--------------------------+---+------------+
| 2023-12-06               | 1 | {"V1":100} |
| 2023-12-06               | 2 | {"V1":200} |
| 2023-12-06               | 1 | {"V1":101} |
+--------------------------+---+------------+
3 rows selected (0.022 seconds

… type of ProjectedColumnExpression instead of KeyValueColumnExpression
…ompletely. Make index as the physical table for CDC
private Map<ImmutableBytesPtr, String> dataColQualNameMap;
private Map<ImmutableBytesPtr, PDataType> dataColQualTypeMap;
// Map<dataRowKey: Map<TS: Map<qualifier: Cell>>>
private Map<ImmutableBytesPtr, Map<Long, Map<ImmutableBytesPtr, Cell>>> dataRowChanges =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty complicated. Maybe you could introduce some abstraction which represents a change as a object. Then you could have a collection of changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was meant to be a quick PoC sort of code and should go away as @TheNamesRai is already reworking the logic to build the row without having to use this intermediate data structure.

@virajjasani virajjasani self-requested a review December 19, 2023 22:40
conn.createStatement().execute("UPSERT INTO " + tableName + " (k, v1) VALUES (1, 101)");
conn.commit();
String cdcName = generateUniqueName();
String cdc_sql = "CREATE CDC " + cdcName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everytime we run CREATE CDC do we need to pass PHOENIX_ROW_TIMESTAMP ? If that is something mandatory we shouldn't make it part of the interface and the system should do it implicitly under the hood.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it seems the uncovered index that is being created as part of CREATE CDC is being created synchronously. That will only work for small tables. For bigger tables the index needs to be created async and then built explicitly using IndexTool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everytime we run CREATE CDC do we need to pass PHOENIX_ROW_TIMESTAMP ? If that is something mandatory we shouldn't make it part of the interface and the system should do it implicitly under the hood.

The feature allows any timestamp like column to be used instead of the PHOENIX_ROW_TIMESTMAP that is why this flexibility is being allowed, however I am not 100% sure about the use cases, I will have an offline discussion on this, thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For bigger tables the index needs to be created async and then built explicitly using IndexTool.

If we do this, won't we also need to project the index status via CDC object, since we intend to keep the index hidden as an implementation detail? Do you think we should have an offline mode for CREATE CDC itself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this, won't we also need to project the index status via CDC object

I think that can be later improvement but supporting creation of async index is necessary because creating sync index for large table will anyways be too expensive or not possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this, won't we also need to project the index status via CDC object

@haridsv I didn't quite understand what you mean by index status here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite understand what you mean by index status here ?

I was referring to the async index creation lifecycle, which should be attributed to the CDC so that user can query and know when it is done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tkhurana In any case, I am not planning to address these as part of this PR, so are you OK to merge this as is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we can tackle that in a separate PR.

Copy link
Contributor Author

@haridsv haridsv Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tkhurana Async index creation is addressed as part of #1802.

Also removed the need to explicitly specify PHOENIX_ROW_TIMESTAMP() as we dropped the support for user specified timestamp column.

@virajjasani
Copy link
Contributor

@tkhurana if you are good with the current state with any additional comments to be taken care of in subsequent PRs against feature branch (or final patch against master branch), we can merge this.

@virajjasani
Copy link
Contributor

With follow-ups to come in separate PRs, merging this one.

@virajjasani virajjasani merged commit 7420443 into apache:PHOENIX-7001-feature Dec 23, 2023
0 of 2 checks passed
@haridsv
Copy link
Contributor Author

haridsv commented Jan 18, 2024

PR #1794 is the follow up that implements the query functionality to the spec.

haridsv added a commit to haridsv/phoenix that referenced this pull request Jan 22, 2024
Address the changes in the CREATE CDC spec in PHOENIX-7001
Also includes some review feedback changes across the changes in PRs apache#1681, apache#1703 and apache#1766
haridsv added a commit to haridsv/phoenix that referenced this pull request Jan 23, 2024
Address the changes in the CREATE CDC spec in PHOENIX-7001
Also includes some review feedback changes across the changes in PRs apache#1681, apache#1703 and apache#1766
haridsv added a commit to haridsv/phoenix that referenced this pull request Jan 24, 2024
Address the changes in the CREATE CDC spec in PHOENIX-7001
Also includes some review feedback changes across the changes in PRs apache#1681, apache#1703 and apache#1766
haridsv added a commit to haridsv/phoenix that referenced this pull request Mar 29, 2024
This commit includes a squash of all the below changes from the PRs apache#1662, apache#1694,
down by removing non-functional changes for the ease of review. On top of it,
the changes have been reworked for the changed state of master, especially the
new submodules.

* 4c9827a Hari.. PHOENIX-7008 Shallow grammar support for CREATE CDC (apache#1662)
* e5220e0 TheN..  PHOENIX-7054 Shallow grammar support for DROP CDC and ALTER CDC  (apache#1694)
* 581e613 Hari.. PHOENIX-7008 Implementation for CREATE CDC  (apache#1681)
* e2ef886 Hari.. PHOENIX-7008 Tweaks, fixes and additional test coverage for CREATE CDC (apache#1703)
* 5d3fd40 TheN.. PHOENIX-7074 DROP CDC Implementation (apache#1713)
* 7420443 Hari.. PHOENIX-7014: Query compiler/optimizer changes along with some PHOENIX-7015 changes (apache#1766)
* da6ddad Kadi.. Add an extra delete mutation for CDC
* 93d586e Kadi.. Add an extra delete mutation during rebuild for CDC index
* f07898f Hari.. PHOENIX-7008: Addressing Jira spec and review feedback changes (apache#1802)
* 87a2ea1 Hari.. PHOENIX-7008: Fix for parser gap and fix for failing test (apache#1812)
* e395780 TheN.. PHOENIX-7015 Implementing CDCGlobalIndexRegionScanner (apache#1813)

Co-authored-by: Saurabh Rai <saurabh.rai@salesforce.com>
haridsv added a commit to haridsv/phoenix that referenced this pull request Mar 29, 2024
…ture

This commit includes a squash of all the below changes from the PRs apache#1662, apache#1694,
down by removing non-functional changes for the ease of review. On top of it,
the changes have been reworked for the changed state of master, especially the
new submodules.

* 4c9827a Hari.. PHOENIX-7008 Shallow grammar support for CREATE CDC (apache#1662)
* e5220e0 TheN..  PHOENIX-7054 Shallow grammar support for DROP CDC and ALTER CDC  (apache#1694)
* 581e613 Hari.. PHOENIX-7008 Implementation for CREATE CDC  (apache#1681)
* e2ef886 Hari.. PHOENIX-7008 Tweaks, fixes and additional test coverage for CREATE CDC (apache#1703)
* 5d3fd40 TheN.. PHOENIX-7074 DROP CDC Implementation (apache#1713)
* 7420443 Hari.. PHOENIX-7014: Query compiler/optimizer changes along with some PHOENIX-7015 changes (apache#1766)
* da6ddad Kadi.. Add an extra delete mutation for CDC
* 93d586e Kadi.. Add an extra delete mutation during rebuild for CDC index
* f07898f Hari.. PHOENIX-7008: Addressing Jira spec and review feedback changes (apache#1802)
* 87a2ea1 Hari.. PHOENIX-7008: Fix for parser gap and fix for failing test (apache#1812)
* e395780 TheN.. PHOENIX-7015 Implementing CDCGlobalIndexRegionScanner (apache#1813)

Co-authored-by: Saurabh Rai <saurabh.rai@salesforce.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants