PHOENIX-7001: Initial implementation of Change Data Capture (CDC) feature #1866

haridsv · 2024-03-29T07:06:56Z

This commit includes a squash of all the below changes from all the CDC feature PRs to address PHOENIX-7001 and further narrows the diff down by removing non-functional changes for the ease of review. On top of it, the changes have been reworked for the changed state of master, especially the new submodules.

4c9827a Hari.. PHOENIX-7008 Shallow grammar support for CREATE CDC (PHOENIX-7008 Shallow grammar support for CREATE CDC #1662)
e5220e0 TheN.. PHOENIX-7054 Shallow grammar support for DROP CDC and ALTER CDC ( PHOENIX-7054 Shallow grammar support for DROP CDC and ALTER CDC #1694)
581e613 Hari.. PHOENIX-7008 Implementation for CREATE CDC (PHOENIX-7008 Implementation for CREATE CDC #1681)
e2ef886 Hari.. PHOENIX-7008 Tweaks, fixes and additional test coverage for CREATE CDC (PHOENIX-7008 Tweaks, fixes and additional test coverage for CREATE CDC #1703)
5d3fd40 TheN.. PHOENIX-7074 DROP CDC Implementation (PHOENIX-7074 DROP CDC Implementation #1713)
7420443 Hari.. PHOENIX-7014: Query compiler/optimizer changes along with some PHOENIX-7015 changes (PHOENIX-7014: Query compiler/optimizer changes along with some PHOENIX-7015 changes #1766)
da6ddad Kadi.. Add an extra delete mutation for CDC
93d586e Kadi.. Add an extra delete mutation during rebuild for CDC index
f07898f Hari.. PHOENIX-7008: Addressing Jira spec and review feedback changes (PHOENIX-7008: Addressing Jira spec and review feedback changes #1802)
87a2ea1 Hari.. PHOENIX-7008: Fix for parser gap and fix for failing test (PHOENIX-7008: Fix for parser gap and a failing test #1812)
e395780 TheN.. PHOENIX-7015 Implementing CDCGlobalIndexRegionScanner (PHOENIX-7015 Implementing CDCGlobalIndexRegionScanner #1813)

…ture This commit includes a squash of all the below changes from the PRs apache#1662, apache#1694, down by removing non-functional changes for the ease of review. On top of it, the changes have been reworked for the changed state of master, especially the new submodules. * 4c9827a Hari.. PHOENIX-7008 Shallow grammar support for CREATE CDC (apache#1662) * e5220e0 TheN.. PHOENIX-7054 Shallow grammar support for DROP CDC and ALTER CDC (apache#1694) * 581e613 Hari.. PHOENIX-7008 Implementation for CREATE CDC (apache#1681) * e2ef886 Hari.. PHOENIX-7008 Tweaks, fixes and additional test coverage for CREATE CDC (apache#1703) * 5d3fd40 TheN.. PHOENIX-7074 DROP CDC Implementation (apache#1713) * 7420443 Hari.. PHOENIX-7014: Query compiler/optimizer changes along with some PHOENIX-7015 changes (apache#1766) * da6ddad Kadi.. Add an extra delete mutation for CDC * 93d586e Kadi.. Add an extra delete mutation during rebuild for CDC index * f07898f Hari.. PHOENIX-7008: Addressing Jira spec and review feedback changes (apache#1802) * 87a2ea1 Hari.. PHOENIX-7008: Fix for parser gap and fix for failing test (apache#1812) * e395780 TheN.. PHOENIX-7015 Implementing CDCGlobalIndexRegionScanner (apache#1813) Co-authored-by: Saurabh Rai <saurabh.rai@salesforce.com>

kadirozde · 2024-04-11T21:29:26Z

phoenix-core/src/it/java/org/apache/phoenix/end2end/CDCBaseIT.java

+                "\""+CDCUtil.getCDCIndexName(cdcName)+"\"");
+        String indexFullName = SchemaUtil.getTableName(schemaName,
+                CDCUtil.getCDCIndexName(cdcName));
+        TestUtil.waitForIndexState(conn, indexFullName, PIndexState.ACTIVE);


Is this necessary? IndexTool runs and waits for the job to complete.

This was copied from existing test code (e.g, see 1 2). But now, I see plenty of other places where this wasn't used, so I will try without and remove if it is working fine.

tkhurana · 2024-04-11T22:20:49Z

phoenix-core-client/src/main/java/org/apache/phoenix/util/CDCUtil.java

+        return CDC_INDEX_PREFIX + SchemaUtil.getTableNameFromFullName(cdcName.toUpperCase());
+    }
+
+    public static boolean isCDCIndex(String indexName) {


Seems like this API assumes that the indexName doesn't have the schema qualifier ?

Yes, that is right. There are 2 calls to this right now, 1 is from the overloaded method that works on PTable and it is passing the name without schema qualifier. The other call comes from the "DROP INDEX" where you won't expect a schema qualifier anyway.

tkhurana · 2024-04-11T22:24:30Z

phoenix-core-client/src/main/java/org/apache/phoenix/util/CDCUtil.java

+        return isCDCIndex(indexTable.getTableName().getString());
+    }
+
+    public static Scan initForRawScan(Scan scan) {


A better name for this API could be initializeScanForCDC or setupScanForCDC because it is doing more than just setting the raw attribute.

Good point! It started purely for setting raw scan attributes so the name was appropriate back then, but your suggestion to rename sounds good.

tkhurana · 2024-04-11T22:33:55Z

phoenix-core/src/it/java/org/apache/phoenix/end2end/CDCDefinitionIT.java

+        assertNoResults(conn, cdcName);
+
+        try {
+            conn.createStatement().execute(cdc_sql);


Can I create multiple cdc's on the same table with different cdc name but same change scope ?

Yes, the CDC objects are completely independent of each other, with their own change scopes, so they can use same or different change scopes.

tkhurana · 2024-04-11T22:35:05Z

phoenix-core/src/it/java/org/apache/phoenix/end2end/CDCDefinitionIT.java

+                " INCLUDE (pre, post) INDEX_TYPE=g");
+
+        cdcName = generateUniqueName();
+        cdc_sql = "CREATE CDC " + cdcName + " ON " + tableName + " INCLUDE (pre, post)";


Instead of using hard coded strings PRE, POST use the constants defined in the code

Sure, just to confirm, you are referring to the line 107, correct? I would like to leave the usage in 102 and 105 as is.

tkhurana · 2024-04-11T22:51:59Z

phoenix-core/src/it/java/org/apache/phoenix/end2end/CDCDefinitionIT.java

+                        saltingConfig[1], null);
+                try {
+                    assertCDCState(conn, cdcName, null, 3);
+                    // Index inherits table salt buckets.


This comment is not right.

It is meant for the next line, will move it.

… index anyway

kadirozde · 2024-04-17T05:50:52Z

I found usage of maps to keep track of mutations and pre and post images are too complex in ITs. The test code should be easy to read and verify. We need a list of mutations and then check the result CDC against this list of mutations. For example, each mutation can be modeled as a timestamp, mutation type, list of column names and list of values represented as strings. This list can be initialized when it is constructed. Creating upsert and delete statements from this mutation list should be straightforward. Comparing the result of CDC against this list also should simple. The pre and post images can be retrived from the data table using SCN connections and verified against the CDC result.

virajjasani · 2024-04-17T06:59:38Z

...ix-core-server/src/main/java/org/apache/phoenix/coprocessor/CDCGlobalIndexRegionScanner.java

+        Gson gson = new GsonBuilder().serializeNulls().create();
+        byte[] value = gson.toJson(changeBuilder.buildCDCEvent()).getBytes(StandardCharsets.UTF_8);
+        CellBuilder builder = CellBuilderFactory.create(CellBuilderType.SHALLOW_COPY);
+        Result cdcRow = Result.create(Arrays.asList(builder
+                .setRow(indexRowKey)
+                .setFamily(ImmutableBytesPtr.cloneCellFamilyIfNecessary(firstCell))
+                .setQualifier(cdcDataTableInfo.getCdcJsonColQualBytes())
+                .setTimestamp(changeBuilder.getChangeTimestamp())
+                .setValue(value)
+                .setType(Cell.Type.Put)
+                .build()));
+        return cdcRow;


Can we use Jackson ObjectMapper to build the Json here? Also, Jackson is more efficient and thread-safe so we can define public static ObjectMapper instance and use the static object here.

This has two advantages:

No need to create new object of Gson everytime we create Json.

No new dependency needs to be introduced for Gson.

We also use JacksonUtil at some places: https://github.com/apache/phoenix/blob/master/phoenix-core-client/src/main/java/org/apache/phoenix/util/JacksonUtil.java

I switched from using Gson to Jackson.

virajjasani · 2024-04-17T07:04:26Z

pom.xml

+      <dependency>
+        <groupId>com.google.code.gson</groupId>
+        <artifactId>gson</artifactId>
+        <version>${gson.version}</version>
+      </dependency>


This can be avoided altogether if we use JacksonUtil

virajjasani · 2024-04-17T07:07:12Z

phoenix-core-client/src/main/java/org/apache/phoenix/util/CDCChangeBuilder.java

+        }
+    }
+
+    public Map buildCDCEvent() {


All the places where raw Map is used, let's replace it with Map<String, Object>

virajjasani · 2024-04-27T16:33:52Z

phoenix-core/src/it/java/org/apache/phoenix/end2end/CDCQueryIT.java

+            createTable(conn, "CREATE TABLE  " + tableName + " (" +
+                    (multitenant ? "TENANT_ID CHAR(5) NOT NULL, " : "") +
+                    "k INTEGER NOT NULL, a_binary binary(10), d Date, t TIMESTAMP, " +
+                    "CONSTRAINT PK PRIMARY KEY " +
+                    (multitenant ? "(TENANT_ID, k) " : "(k)") + ")", encodingScheme, multitenant,
+                    tableSaltBuckets, false, null);


Let's also add VARBINARY test?

Added a new test that has VARBINARY and a few other types.

…mages

…p to the previous commit

virajjasani · 2024-05-16T03:54:30Z

Now that JSON PR is merged, we can also address it with changes/pre/post image and include some tests for JSON data.

…ure-squash

kadirozde

+1, Thanks!

kadirozde · 2024-05-20T07:29:27Z

Now that JSON PR is merged, we can also address it with changes/pre/post image and include some tests for JSON data.

Let us open a separate Jira to support complex data types including Array and JSON.

virajjasani · 2024-05-28T16:49:41Z

Belated +1

palashc · 2024-05-30T18:10:31Z

phoenix-core/src/it/java/org/apache/phoenix/end2end/index/BaseIndexIT.java

@@ -1514,7 +1514,6 @@ public void testLastDDLTimestampOnAsyncIndexes() throws Exception {

            // run the index MR job.
            IndexToolIT.runIndexTool(false, TestUtil.DEFAULT_SCHEMA_NAME, tableName, indexName);
-            TestUtil.waitForIndexState(conn, fullIndexName, PIndexState.ACTIVE);


@haridsv Would you remember why this change was made in this PR? This breaks a few tests on my PR #1883.
Rushabh made the change originally and I think it is needed. Please let me know if there was a reason to remove it in this PR.

Actually, this change is causing these 3 tests to fail continuously on all PRs on master:
GlobalImmutableTxIndexWithRegionMovesIT.testLastDDLTimestampOnAsyncIndexes
GlobalMutableTxIndexIT.testLastDDLTimestampOnAsyncIndexes
GlobalImmutableTxIndexIT.testLastDDLTimestampOnAsyncIndexes

I see it was removed based on Kadir's comment here - #1866 (comment)

@tkhurana Should I add it back to BaseIndexIT and BaseIndexWithRegionMovesIT in my PR, although it will be a while before it gets merged to master? Not sure why the behaviour of IndexTool is different in these 3 cases.

We will have to put this back, we are just discussing this internally.

haridsv and others added 3 commits March 29, 2024 12:35

Add gson as an explicit depenndency to phoenix-core-server

fdd9ea9

Fix merge conflict resolution issues, should fix BackwardCompaitiblityIT

ce9489b

virajjasani self-requested a review April 5, 2024 03:45

Merge branch 'master' into PHOENIX-7001-feature-squash

8bb623b

kadirozde reviewed Apr 11, 2024

View reviewed changes

tkhurana reviewed Apr 11, 2024

View reviewed changes

haridsv added 3 commits April 12, 2024 10:28

Address some review comments

85446da

Remove unnecessary calls to waitForIndexState in existing code

d285a95

Remove code supporting INDEX_TYPE attribute as we don't support local…

743b980

… index anyway

virajjasani reviewed Apr 17, 2024

View reviewed changes

virajjasani reviewed Apr 27, 2024

View reviewed changes

haridsv added 6 commits May 9, 2024 18:57

Address review feedback on using SCN queries to verify pre and post i…

507800d

…mages

Reworked ChangeRow to purely represent the change alone as a follow u…

bebb18c

…p to the previous commit

Generic mutation generator and a new test based on that

3e20909

Added more types

a8280d3

More debug logging

c9d2bce

Fixes for a couple of test issues and debug code commented

ec90584

haridsv added 2 commits May 17, 2024 10:06

Special handling in IT for trailing spaces in CHAR type

992b157

Use Jackson instead of Gson to avoid pulling in a new dependency

c31d0ef

haridsv requested review from virajjasani and tkhurana May 20, 2024 06:43

haridsv requested a review from kadirozde May 20, 2024 06:43

Merge remote-tracking branch 'upstream/master' into PHOENIX-7001-feat…

5d479b9

…ure-squash

kadirozde approved these changes May 20, 2024

View reviewed changes

haridsv and others added 2 commits May 20, 2024 20:39

Fix build issues

74e8e63

Empty-Commit

426e5f5

kadirozde merged commit bd367cc into apache:master May 28, 2024
0 of 7 checks passed

palashc reviewed May 30, 2024

View reviewed changes

haridsv mentioned this pull request May 31, 2024

PHOENIX-7001: Addendum to initial CDC feature #1899

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHOENIX-7001: Initial implementation of Change Data Capture (CDC) feature #1866

PHOENIX-7001: Initial implementation of Change Data Capture (CDC) feature #1866

haridsv commented Mar 29, 2024 •

edited

Loading

kadirozde Apr 11, 2024

haridsv Apr 12, 2024

tkhurana Apr 11, 2024

haridsv Apr 12, 2024

tkhurana Apr 11, 2024

haridsv Apr 12, 2024 •

edited

Loading

tkhurana Apr 11, 2024

haridsv Apr 12, 2024

tkhurana Apr 11, 2024

haridsv Apr 12, 2024

tkhurana Apr 11, 2024

haridsv Apr 12, 2024

kadirozde commented Apr 17, 2024

virajjasani Apr 17, 2024 •

edited

Loading

virajjasani Apr 17, 2024

haridsv May 20, 2024

virajjasani Apr 17, 2024

virajjasani Apr 17, 2024

virajjasani Apr 27, 2024

haridsv May 20, 2024

virajjasani commented May 16, 2024

kadirozde left a comment

kadirozde commented May 20, 2024

virajjasani commented May 28, 2024

palashc May 30, 2024

tkhurana May 30, 2024

palashc May 30, 2024

palashc May 30, 2024 •

edited

Loading

virajjasani May 30, 2024

PHOENIX-7001: Initial implementation of Change Data Capture (CDC) feature #1866

PHOENIX-7001: Initial implementation of Change Data Capture (CDC) feature #1866

Conversation

haridsv commented Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haridsv Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kadirozde commented Apr 17, 2024

virajjasani Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

virajjasani commented May 16, 2024

kadirozde left a comment

Choose a reason for hiding this comment

kadirozde commented May 20, 2024

virajjasani commented May 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

palashc May 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haridsv commented Mar 29, 2024 •

edited

Loading

haridsv Apr 12, 2024 •

edited

Loading

virajjasani Apr 17, 2024 •

edited

Loading

palashc May 30, 2024 •

edited

Loading