-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STORM-3162: Fix concurrent modification bug #2800
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find. I'll be happy to help update the clojure code.
It's unrelated to this, but I'm wondering if we should replace the Map<String, Object>
heartbeat model with a real class. It looks to me like it's currently a map with some magic strings in it.
if (cache == null) { | ||
if (executorBeats == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The construction here seems a little odd. Initializing cache and executorBeats when null isn't necessary. I think we should keep the "if cache and executorBeats are null" clause, then replace the other two branches with something like Map<String, Object> currBeat = cache == null ? null : cache.get(executor);
and equivalent for newBeat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, i did see any reason why this method is not thread safe, cause it almost a tool method, only to initialize a Map cache which is updated into Nimbus
heartbeatsCache through heartbeatsCache.getAndUpdate(new Assoc<>(topoId, cache))
, ConcurrentModificationException
happens when we iterate over a collection through iterator and also modify it, but here, we only iterate the executor list and do not modify any of the list entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The very same map created here will be used in updateHeartbeatCache
, which may be modified concurrently there. Hope this answered your question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concurrently modify a HashMap is ok if we are not also iterate over it, for heartbeats updating, we only need a final consistency.
//else refresh nimbus-time and executor-reported-time by heartbeats reporting | ||
for (List<Integer> executor : executors) { | ||
cache.put(executor, updateExecutorCache(cache.get(executor), executorBeats.get(executor), timeout)); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Explicit return can help decrease indentation, I liked it better before.
heartbeatsCache.getAndUpdate(new Assoc<>(topoId, cache)); | ||
StatsUtil.convertExecutorBeats(stormClusterState.executorBeats(topoId, existingAssignment.get_executor_node_port())); | ||
heartbeatsCache.compute(topoId, (k, v) -> | ||
//Guaranteed side-effect-free |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably put this requirement in comments on the two methods in StatsUtil rather than here, so they stay side effect free.
Fixed the nimbus_test zd-project#1 |
if (executorBeats == null) { | ||
for (Map.Entry<List<Integer>, Map<String, Object>> executorbeat : cache.entrySet()) { | ||
Map<String, Object> beat = executorbeat.getValue(); | ||
//If not executor beats, refresh is-timed-out of the cache which is done by master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code branch can only be invoked by 'Nimbus' and it is always a single thread modification, so please make sure if it will throw any ConcurrentModificationException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is wrapped in another method exposed in Thrift API, see sendSupervisorWorkerHeartbeat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually where the ConcurrentModificationException
is thrown. Notice that the old code invokes both cache.entrySet()
and cache.put()
in this method. Since it's exposed through thrift, it's possible to have ConcurrentModificationException
. Also see travis log here for an example: https://travis-ci.org/apache/storm/jobs/408719153#L1897
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then please check the code invocation when the passed in executorBeats == null
, for sendSupervisorWorkerHeartbeat
we will never get a null but at least a empty map. The code
comment already address that.
For testing, i believe there should be some bug to fix, but this code modification is not that necessary.
Actually i used the 2.0 version storm for our 30 nodes cluster at least for 3 months and i never got a ConcurrentModificationException
.
@@ -490,7 +493,7 @@ public Nimbus(Map<String, Object> conf, INimbus inimbus, IStormClusterState stor | |||
stormClusterState = makeStormClusterState(conf); | |||
} | |||
this.stormClusterState = stormClusterState; | |||
this.heartbeatsCache = new AtomicReference<>(new HashMap<>()); | |||
this.heartbeatsCache = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change back to AtomicReference
cause it is multi_thread visible, actually the thrift server serves the RPC methods through multi threading, so we should keep the heartbeatsCache modification be seen as much as possible.
Please make sure again why the |
I'm sorry if this turns a bit verbose, but I'm going to write down what I see as the issue here, so we can hopefully come to a common understanding (and so I don't forget and have to look at this again) As far as I can tell, the uses of However in the
heartbeatsCache and modify it.
There are a couple of problems here. First, the The reason this exception isn't thrown in a real cluster is that the
This only happens when Nimbus is booting up as part of I think the fix here should be making sure that |
@srdo @zd-project One thing needs to clarify is that There does have possibility that supervisor/worker will walk into code branch: storm/storm-client/src/jvm/org/apache/storm/stats/StatsUtil.java Lines 1576 to 1579 in 4c42ee3
and Nimbus the other: storm/storm-client/src/jvm/org/apache/storm/stats/StatsUtil.java Lines 1568 to 1574 in 4c42ee3
I think the key here is we used a |
I don't think fixing the storm/storm-client/src/jvm/org/apache/storm/stats/StatsUtil.java Lines 1576 to 1579 in 4c42ee3
Regarding fixing the |
Regarding performance, consider that Nimbus is already copying Changing |
@srdo Cause we only need a final consistency, fix the |
@danny0405 Could you elaborate on why fixing the |
@srdo |
@danny0405 That doesn't sound safe to me. I think you're right that it works fine most of the time, but if there are key collisions or an insert leads the map to get resized, I would think that two threads modifying the map at the same time could interfere with each other. Either way, if you're okay with making the whole function thread safe, I think we should do it. |
@zd-project I'd like to finish this up. Let me know if you want to make the last couple of fixes, otherwise I'll open a new PR containing this fix. @danny0405 I thought about it a bit more, and while I still think we can fix this by making |
Agreed. I think atomic reference is really just for Clojure compatibility there. I’ll finish this up. |
@srdo But the |
@danny0405 Thanks for explaining. I'm not sure I understand why the scheduling thread will see older values with |
@srdo I think |
Please take a look at #2836 as an alternative. It is a much bigger patch, but I think the refactoring it does will make things much easier longer term. Having done the other patch I think this one does fix the immediate issue, but the current HB cache is so complex that I didn't feel like I understood all of the ways the cache was accessed before doing the other patch. |
STORM-3162: Fix nimbus_test
As far as my understanding of the code goes I believe this fix should resolve this particular issue. However my understanding in HB cache was limited and the PR was pushed hastily during my last few days of internship. I definitely support switching to the alternative if it actually solves the deeper structural issue. Please let me know if there's anything else I can help with. |
See: https://issues.apache.org/jira/browse/STORM-3162
I managed to alter the Java implementation and it seems to be passing storm-server test. However in storm-core there's a nimbus_test.clj which depends on some of the older implementation that I changed in this PR. I myself am not very familiar with Clojure so I don't know how to fix it. If any of you can take a look it'll be great.