Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Intermediate table cleanup process after merge flush #399

Open
Blackrobe opened this issue Mar 22, 2024 · 0 comments
Open

Question: Intermediate table cleanup process after merge flush #399

Blackrobe opened this issue Mar 22, 2024 · 0 comments

Comments

@Blackrobe
Copy link

Blackrobe commented Mar 22, 2024

Hi,

I was reading how the sink connector do merge flush event and reads that after a batch of data in the intermediate table has been merged, a DELETE FROM statement is executed as shown here
https://github.com/confluentinc/kafka-connect-bigquery/blob/f176198503e37e743d04b7bf687d7d08571ea851/kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/MergeQueries.java

But I was confused on the purpose of this line 492

.append("AND _PARTITIONTIME IS NOT NULL")

When you're doing "DELETE FROM" and ignoring the null partition, doesn't this mean that there will be a batch number that will never be merged?

I was thinking what would happen in this scenario below:
...
at 3:00 AM: Merge flush for batchNumber 3190
at 3:01 AM: Delete from intermediate table where batchNumber <= 3190 and _partitiontime is not null
at 4:00 AM: Merge flush for batchNumber 3191
...

From this alone, the second step won't delete records that satisfies both "batchNumber = 3190" and "_partitiontime is null".

Doesn't this basically mean those record that are left in batchNumber 3190 won't ever be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant