Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-2979] Fix a race condition in getWatermark() in KafkaIO. #3985

Closed
wants to merge 3 commits into from

Conversation

rangadi
Copy link
Contributor

@rangadi rangadi commented Oct 12, 2017

Two fixes :

  • Don't set curRecord to null before updating. If user deserializers throw, ok to keep curRecord pointing to old one.
  • Use atomic references for curRecord and curTimestamp since these are accessed in getWatermark() and getCurrentTimestamp(). These methods could be called concurrently with advance().

@coveralls
Copy link

Coverage Status

Coverage increased (+1.0%) to 69.61% when pulling 70d13ff on rangadi:race_in_kafkaio into 6dd90d8 on apache:master.

@rangadi
Copy link
Contributor Author

rangadi commented Oct 12, 2017

+R @tgroh. Thanks Thomas.

private KafkaRecord<K, V> curRecord;
private Instant curTimestamp;
// curRecord and curTimestamp are accessed outside advance(), which might be another thread.
private AtomicReference<KafkaRecord<K, V>> curRecord = new AtomicReference<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely smelly; the current record and current timestamp must be updated atomically and simultaneously, so using two separate atomic references is not sufficient - you need to store the timestamped record here

Copy link
Contributor Author

@rangadi rangadi Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.. I thought of commenting on that explicitly. Right now we don't depend on them to be consistent. But I agree, simpler to do the right thing.

Copy link
Contributor Author

@rangadi rangadi Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Please take a look.

@coveralls
Copy link

Coverage Status

Coverage increased (+1.0%) to 69.612% when pulling b4fc45f on rangadi:race_in_kafkaio into 6dd90d8 on apache:master.

@rangadi
Copy link
Contributor Author

rangadi commented Oct 16, 2017

@tgroh PTAL. I changed it to access them within same synchronized block.

@tgroh
Copy link
Member

tgroh commented Oct 17, 2017

Wait, sorry, I just looked at this again.

A Reader should only be interacted with by a single thread except in very specific circumstances. advance, getCurrent(), getWatermark, and getCurrentTimestamp must all be interacted with in a thread-safe manner (usually this means a single thread). The reader should not have to perform any synchronization within them.

Specifically, the UnboundedReader javadoc states:

  /**
   * A {@code Reader} that reads an unbounded amount of input.
   *
   * <p>A given {@code UnboundedReader} object will only be accessed by a single thread at once.
   */

The synchronization is as such unnecessary

@rangadi
Copy link
Contributor Author

rangadi commented Oct 17, 2017

That was my impression too.. that's we never synchronized around getWatermark(). Good that it is in the contract as well.

Looks like Flink does call getWatermak() concurrently. cc @aljoscha

@rangadi
Copy link
Contributor Author

rangadi commented Oct 19, 2017

@wtanaka, It looks like the runner is not supposed to call getWatermark() concurrently with other calls in the source.
@iemejia @aljoscha, could you comment on this concurrent access in Flink runner? If we need this fix as a short term work around, I could probably merge it with a comment. I don't see much of performance hit.

Source API explicitly mentions single threaded access : https://github.com/apache/beam/blob/release-2.0.0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Source.java#L131
getWatermark() documentation does not refer to this explicitly, but might be implicitly expected.

@aljoscha
Copy link
Contributor

Yes, this is a FlinkRunner problem. In UnboundedSourceWrapper the lock scope is not big enough, we synchronise in emitElement() but should instead synchronise inside the reader loop in run().

Raghu Angadi added 2 commits October 20, 2017 10:46
Two fixes :
 - Don't set curRecord to null before updating. If user deserializers
   throw, ok to keep curRecord pointing to old one.
 - use atomic references for curRecord and curTimestamp since
   these are accessed in getWatermark() and getCurrentTimestamp().
   These methods could be called concurrently with advance().
@rangadi
Copy link
Contributor Author

rangadi commented Oct 20, 2017

Thanks @aljoscha. Will be nice to fix it.
@tgroh : Updated the the fix to remove synchronization. Limited this PR to fixing first issue mentioned in the description: Avoid setting curRecord to null. It should fix the user issue for all practical purposes.

@rangadi
Copy link
Contributor Author

rangadi commented Oct 24, 2017

+R: @chamikaramj, @tgroh

@chamikaramj
Copy link
Contributor

LGTM

@chamikaramj
Copy link
Contributor

Run Java Precommit

1 similar comment
@chamikaramj
Copy link
Contributor

Run Java Precommit

@reuvenlax
Copy link
Contributor

tests still failing?

@chamikaramj
Copy link
Contributor

Run Java Precommit

@rangadi
Copy link
Contributor Author

rangadi commented Oct 30, 2017

Run Java Precommit.

@rangadi
Copy link
Contributor Author

rangadi commented Oct 30, 2017

@chamikaramj, how many attempts should we attempt ;-)

@chamikaramj
Copy link
Contributor

Ok, I give up. Will test locally and commit :)

@asfgit asfgit closed this in aa26f4b Oct 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants