Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Updates with Timestamp+incrementing mode in Postgres #172

Closed
salemn opened this issue Dec 1, 2016 · 23 comments
Closed

Missing Updates with Timestamp+incrementing mode in Postgres #172

salemn opened this issue Dec 1, 2016 · 23 comments
Labels

Comments

@salemn
Copy link

@salemn salemn commented Dec 1, 2016

Hello Everybody,

i'm deploying kafka-connect, in timestamp+incrementing mode, on our platform in order to get data transferred from our main database to secondary one. I'm using source and sink connectors. Both databases are Postgres databases

I'm facing a missing lines issues. the case is not frequent but happens from time to time.

Here is the situation:

postgres=# CREATE TABLE foo(id serial primary key, updated timestamp without time zone);
CREATE TABLE
postgres=# CREATE TRIGGER update_updated_column
postgres-#   BEFORE INSERT OR UPDATE
postgres-#   ON foo
postgres-#   FOR EACH ROW
postgres-#   EXECUTE PROCEDURE update_updated_column();
--
SESSION 1:
postgres=# begin;
BEGIN
postgres=# INSERT INTO foo(updated) VALUES (null);
INSERT 0 1
postgres=# SELECT * FROM foo;
 id |          updated           
----+----------------------------
  1 | 2016-12-01 08:53:54.099995
(1 ligne)
---

SESSION 2 :
postgres=# BEGIN;
BEGIN
postgres=# INSERT INTO foo(updated) VALUES (null);
INSERT 0 1
postgres=# SELECT * FROM foo;
 id |          updated           
----+----------------------------
  3 | 2016-12-01 08:54:46.420757
(1 ligne)

postgres=# COMMIT;
COMMIT
postgres=# SELECT * FROM foo;
 id |          updated           
----+----------------------------
  3 | 2016-12-01 08:54:46.420757
(1 ligne)
----

SESSION 1 :
postgres=# COMMIT;
COMMIT
postgres=# SELECT * FROM foo;
 id |          updated           
----+----------------------------
  1 | 2016-12-01 08:53:54.099995
  3 | 2016-12-01 08:54:46.420757
(2 lignes)

Line 1 is missing in my target database.

Is there any way to use to source connector to avoird such situation?

Best Regards
Naoufel

@shikhar
Copy link
Contributor

@shikhar shikhar commented Dec 8, 2016

@salemn is the missing row present in the Kafka topic written to by the source connector? (to isolate whether the issue is on the source or the sink side of things)

@salemn
Copy link
Author

@salemn salemn commented Dec 9, 2016

@shikhar i think that the issue is in the source connector as The row is missing in the topic.

@gfodor
Copy link

@gfodor gfodor commented Dec 11, 2016

Is there any chance that some of the records that you are seeing get dropped have approximately the same timestamp as other records? The PR here:

#49

Basically resolves an issue where the queries being sent to postgresql were querying at millisecond (vs microsecond) resolution, so my suspicion is that if you have records whose timestamps fall within the same millisecond that some could get missed since they would be mischaracterized as falling earlier than the last committed offset timestamp if consumed across commits. (Right now this is just a theory, I haven't confirmed it)

@gilbsgilbs
Copy link

@gilbsgilbs gilbsgilbs commented Dec 11, 2016

@gfodor I think the issue is a basic race conditon.

  1. Transaction TR0 begins, inserts row R0 (at updated = T0), and hangs without committing.
  2. Transaction TR1 begins, inserts row R1 (at updated = T1 > T0), and commits.
  3. Source connector polls. It sees R1, but not R0 (because it's not committed yet). Last updated timestamp is now at T1. Source connector will assume that nothing could have happened before T1.
  4. TR0 commits eventually. Recall that R0 has an older updated timestamp than R1.
  5. Source connector polls, but misses R0 because T0 < T1.

@ewencp
Copy link
Contributor

@ewencp ewencp commented Jan 4, 2017

@salemn Did you set timestamp.delay.interval.ms? Without a delay there is a possibility that transactions may still be open and not visible to queriers despite timestamps & rows that are "newer" are visible. You'd want to set this value to be conservatively longer than the longest time you expect a transaction to take. This is just a fundamental limitation of working through the JDBC interface.

@willyhoang
Copy link
Contributor

@willyhoang willyhoang commented Jan 20, 2017

If I'm using a source connector with incrementing mode (with only an incrementing id and no timestamp), is there an equivalent config that we could use to resolve this race condition?

@SemanticBeeng
Copy link

@SemanticBeeng SemanticBeeng commented Mar 23, 2017

Can this #172 (comment) be happening when using a "mode": "timestamp+incrementing" and "timestamp.delay.interval.ms": 3000 (assumption is that transactions end faster than 1 sec)? @ewencp

@gilbsgilbs
Copy link

@gilbsgilbs gilbsgilbs commented Apr 21, 2017

@SemanticBeeng Assuming that the delay is always higher than the slowest transaction, that is perfectly fine. However, a tiny delay such as 3 seconds will be difficult to keep in the long run. One day or another, you'll have some misses and need to implement a backup mechanism or increase the delay. I don't know what is your use-case, but I would say that most of the time, if it's really important for you to not miss anything, you'll need to increase the delay drastically, trading off the lag. That's the main reason why we gave up on using kafka-connect-jdbc.

@song860918
Copy link

@song860918 song860918 commented May 16, 2017

@willyhoang Did you get the incrementing mode problem figured out? I'm running into the same issue.

@makearl
Copy link

@makearl makearl commented May 30, 2017

For incrementing mode, it looks like this becomes a problem once the connector finishes backfilling and catches up to realtime. Should there be a similar configuration item to "timestamp.delay.interval.ms" for incrementing mode?

@msingh00
Copy link

@msingh00 msingh00 commented Feb 16, 2018

@gilbsgilbs what are you using instead of kafka-connect-jdbc and why wouldn't any other odbc driver exhibit the same problem?

@vibtrip
Copy link

@vibtrip vibtrip commented Mar 14, 2018

Hi, did the original author of this issue get the issue resolved? I am facing the same issue with SQL Server. Basically, it seems like even in "timestamp+incrementing" mode, the connect is doing a strictly "incrementing". No updates involving earlier incrementing IDs show up but new rows inserted with higher IDs do show up. @salemn

@vibtrip
Copy link

@vibtrip vibtrip commented Mar 14, 2018

@ewencp was timestamp+incrementing in the original issue here resolved with the delay?? Is 5 seconds deays good enough for polling every second?

@gilbsgilbs
Copy link

@gilbsgilbs gilbsgilbs commented Mar 14, 2018

@dizzy0ny We ended up using nothing and abandoned the project. Any ODBC driver will suffer from the same issues because it tries to solve a problem with wrong tools. There's a reason for DBMS replication systems being based on transaction ids: it's the only way to guarantee consistency and real-time. Not using XIDs (one way or the other) implies a trade-off between consistency (missed rows due to "transactions shadowing" for instance) and real-time (requires a huge lag to ensure all transactions are over). This might be OK in some (maybe most) cases, but I really reckon the doc should be more honest about this. Anyhow, some tools that let you leverage logical replication to solve the same problem have emerged recently. Take a look at Debezium or Bottledwater (which seems abandoned) or even AWS DMS. They might be a better fit.

@vibtrip I don't think your problem is related to the OP's. The issue is about some marginal missed updates. Please open another issue.

@msingh00
Copy link

@msingh00 msingh00 commented Mar 15, 2018

@gilbsgilbs the only way i can think of to solve this may be to violate ACID and use read uncommitted. that presents a whole lot of other issues however.
However, increasing the timestamp.delay.interval.ms doesn't necessarily mean you are incurring lag does it? i think it simply mean the query will look back further..meaning it may return more data than it should which may have been processed already. That's what i thought and could be wrong.

@gilbsgilbs
Copy link

@gilbsgilbs gilbsgilbs commented Mar 15, 2018

@dizzy0ny

the only way i can think of to solve this may be to violate ACID and use read uncommitted. that presents a whole lot of other issues however.

Yeah we also thought of this, but as you mention it has other terrible drawbacks. You'll indeed see the rows soon enough, however if they're rollbacked, you will replicate them which you most likely really want to avoid.

However, increasing the timestamp.delay.interval.ms doesn't necessarily mean you are incurring lag does it? i think it simply mean the query will look back further..meaning it may return more data than it should which may have been processed already. That's what i thought and could be wrong.

I may also be wrong and have misinterpreted the doc. But,

The first execution will fetch all available records (i.e. starting at timestamp 0) until current time minus the delay. Every following execution will get data from the last time we fetched until current time minus the delay.

Which means to me that you'll lag behind the delay. As I understand it, the connector will always continue where it last finished, no matter what. This is probably to avoid spamming the database and fetching too many uneeded records, or even worse, pushing tons of duplicate events into Kafka. I think this is totally fine, expected and correct behavior.

@vibtrip
Copy link

@vibtrip vibtrip commented Mar 15, 2018

@dizzy0ny @gilbsgilbs
My problem appears to be resolved with increased offset.flush.interval.ms for common workers config, and timestamp.delay.interval.ms . We were facing marginally missed updates from subsequent database transactions. The first set of inserts always worked fine, but in subsequent batches if there were some updates to existing rows plus new inserts, we were seeing the updates gettig dropped, which by the Kafka sorts the records were always the first few due to incremental IDs.

@SemanticBeeng
Copy link

@SemanticBeeng SemanticBeeng commented Mar 16, 2018

Indeed Debezium has (change data capture is) the proper paradigm.

@tishna
Copy link

@tishna tishna commented Apr 6, 2018

I have looked at numerous possibilities of why we might miss some records. Can anyone please advice if this issue can happen in NOLOCK SQL queries too? Or I can eliminate issue mentioned in this thread if we are using NOLOCK.

Please advise at the earliest, thanks!

@AnilGodaba
Copy link

@AnilGodaba AnilGodaba commented May 8, 2018

@salemn was your issue resolved? we are facing similar issue (records are not picked from view sometimes). Please share your resolution if your issue was fixed with changing timestamp.delay.interval.ms and or offset.flush.interval.ms.
Can any other in this forum help?

@wicknicks
Copy link
Member

@wicknicks wicknicks commented Nov 9, 2018

@AnilGodaba @tushtinaryani: missing records typically happen when records are inserted/updated out of order. A workaround for that is to add a new column which contains a monotonically increasing timestamp or incrementing id for the JDBC connector to detect changes, or specify a query that can determine the changes in the table. Or as other people have suggested in this thread, consider using Debezium for CDC capture.

@ekoutanov
Copy link

@ekoutanov ekoutanov commented Jun 6, 2019

This question has also been asked on stackoverflow post. It is somewhat concerning that the Confluent documentation fails to mention this as a potential safety consideration, in spite of this being a well-known issue, open for 2.5 years.

@rmoff
Copy link

@rmoff rmoff commented Apr 21, 2021

If you need to guarantee that you will capture every single event from your database, you need to use log-based Change Data Capture (with connectors such as Debezium).

The JDBC Source connector implements query-based Change Data Capture.

For detailed explanation of this see

@rmoff rmoff closed this as completed Apr 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests