STORM-3162: Fix concurrent modification bug #2800

zd-project · 2018-08-10T17:25:23Z

See: https://issues.apache.org/jira/browse/STORM-3162

I managed to alter the Java implementation and it seems to be passing storm-server test. However in storm-core there's a nimbus_test.clj which depends on some of the older implementation that I changed in this PR. I myself am not very familiar with Clojure so I don't know how to fix it. If any of you can take a look it'll be great.

srdo

Nice find. I'll be happy to help update the clojure code.

It's unrelated to this, but I'm wondering if we should replace the Map<String, Object> heartbeat model with a real class. It looks to me like it's currently a map with some magic strings in it.

srdo · 2018-08-16T15:03:32Z

storm-client/src/jvm/org/apache/storm/stats/StatsUtil.java

        if (cache == null) {
+            if (executorBeats == null) {


The construction here seems a little odd. Initializing cache and executorBeats when null isn't necessary. I think we should keep the "if cache and executorBeats are null" clause, then replace the other two branches with something like Map<String, Object> currBeat = cache == null ? null : cache.get(executor); and equivalent for newBeat.

Sorry, i did see any reason why this method is not thread safe, cause it almost a tool method, only to initialize a Map cache which is updated into Nimbus heartbeatsCache through heartbeatsCache.getAndUpdate(new Assoc<>(topoId, cache)), ConcurrentModificationException happens when we iterate over a collection through iterator and also modify it, but here, we only iterate the executor list and do not modify any of the list entry.

The very same map created here will be used in updateHeartbeatCache, which may be modified concurrently there. Hope this answered your question.

Concurrently modify a HashMap is ok if we are not also iterate over it, for heartbeats updating, we only need a final consistency.

srdo · 2018-08-16T15:06:31Z

storm-client/src/jvm/org/apache/storm/stats/StatsUtil.java

-        //else refresh nimbus-time and executor-reported-time by heartbeats reporting
-        for (List<Integer> executor : executors) {
-            cache.put(executor, updateExecutorCache(cache.get(executor), executorBeats.get(executor), timeout));
+        } else {


Nit: Explicit return can help decrease indentation, I liked it better before.

srdo · 2018-08-16T15:22:35Z

storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java

-        heartbeatsCache.getAndUpdate(new Assoc<>(topoId, cache));
+            StatsUtil.convertExecutorBeats(stormClusterState.executorBeats(topoId, existingAssignment.get_executor_node_port()));
+        heartbeatsCache.compute(topoId, (k, v) ->
+                //Guaranteed side-effect-free


We should probably put this requirement in comments on the two methods in StatsUtil rather than here, so they stay side effect free.

srdo · 2018-08-16T16:29:40Z

Fixed the nimbus_test zd-project#1

danny0405 · 2018-08-25T15:09:14Z

storm-client/src/jvm/org/apache/storm/stats/StatsUtil.java

        if (executorBeats == null) {
-            for (Map.Entry<List<Integer>, Map<String, Object>> executorbeat : cache.entrySet()) {
-                Map<String, Object> beat = executorbeat.getValue();
+            //If not executor beats, refresh is-timed-out of the cache which is done by master


This code branch can only be invoked by 'Nimbus' and it is always a single thread modification, so please make sure if it will throw any ConcurrentModificationException.

I believe this is wrapped in another method exposed in Thrift API, see sendSupervisorWorkerHeartbeat

This is actually where the ConcurrentModificationException is thrown. Notice that the old code invokes both cache.entrySet() and cache.put() in this method. Since it's exposed through thrift, it's possible to have ConcurrentModificationException. Also see travis log here for an example: https://travis-ci.org/apache/storm/jobs/408719153#L1897

Then please check the code invocation when the passed in executorBeats == null, for sendSupervisorWorkerHeartbeat we will never get a null but at least a empty map. The code
comment already address that.

For testing, i believe there should be some bug to fix, but this code modification is not that necessary.

Actually i used the 2.0 version storm for our 30 nodes cluster at least for 3 months and i never got a ConcurrentModificationException.

danny0405 · 2018-08-25T15:12:03Z

storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java

@@ -490,7 +493,7 @@ public Nimbus(Map<String, Object> conf, INimbus inimbus, IStormClusterState stor
            stormClusterState = makeStormClusterState(conf);
        }
        this.stormClusterState = stormClusterState;
-        this.heartbeatsCache = new AtomicReference<>(new HashMap<>());
+        this.heartbeatsCache = new ConcurrentHashMap<>();


Please change back to AtomicReference cause it is multi_thread visible, actually the thrift server serves the RPC methods through multi threading, so we should keep the heartbeatsCache modification be seen as much as possible.

danny0405 · 2018-08-25T15:14:24Z

Please make sure again why the ConcurrentModificationException happens and attach the stack trace.

srdo · 2018-08-25T19:07:17Z

I'm sorry if this turns a bit verbose, but I'm going to write down what I see as the issue here, so we can hopefully come to a common understanding (and so I don't forget and have to look at this again)

As far as I can tell, the uses of heartbeatsCache in Nimbus are thread safe, because the values are never modified, just overwritten. That is, we don't do heartbeatsCache.get(topoId).put(foo, bar), instead we do heartbeatsCache.getAndUpdate(func), which replaces the value entirely. I don't believe we need further synchronization here, since the AtomicReference ensures that the value changes are propagated to all threads, and two threads reading from an effectively immutable map at the same time should be fine(?)

However in the updateHeartbeatCache method in StatsUtil

storm/storm-client/src/jvm/org/apache/storm/stats/StatsUtil.java

Line 1565 in 4c42ee3

    
           public static void updateHeartbeatCache(Map<List<Integer>, Map<String, Object>> cache,

we take one of the values from heartbeatsCache and modify it.

There are a couple of problems here. First, the cache value is a regular HashMap and not a ConcurrentHashMap, so modifying it from two threads at once isn't safe. Second, in the branch in updateHeartbeatCache where executorBeats is null, we iterate over the cache parameter. If one thread is in the iteration, and another thread is in the other branch in updateHeartbeatCache, we get the exception.

The reason this exception isn't thrown in a real cluster is that the executorBeats parameter is only null when called from

storm/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java

Line 1681 in 4c42ee3

StatsUtil.updateHeartbeatCache(heartbeatsCache.get().get(topoId),

This only happens when Nimbus is booting up as part of launchServer, or when someone triggers a rebalance in the topology. We see it in the tests, because Nimbus and the supervisors are started concurrently, so Nimbus can be in one branch in StatsUtil.updateHeartbeatCache while one of the supervisors is in the other branch. It can technically happen in a real cluster, but someone would have to get extremely unlucky with rebalance timing.

I think the fix here should be making sure that StatsUtil.updateHeartbeatCache is thread safe. One option is to make the cache value a ConcurrentHashMap. Another option would be to make updateHeartbeatCache create and return a new map, instead of modifying the existing one.

danny0405 · 2018-08-26T03:37:36Z

@srdo @zd-project
Thx for your explanation, that make sense for me.

One thing needs to clarify is that executorBeats parameter for StatsUtil#updateHeartbeatCache is null for every scheduling round of master in order to refresh the is-timed-out flag.

There does have possibility that supervisor/worker will walk into code branch:

storm/storm-client/src/jvm/org/apache/storm/stats/StatsUtil.java

Lines 1576 to 1579 in 4c42ee3

    
           //else refresh nimbus-time and executor-reported-time by heartbeats reporting 
        
           for (List<Integer> executor : executors) { 
        
               cache.put(executor, updateExecutorCache(cache.get(executor), executorBeats.get(executor), timeout)); 
        
           }

and Nimbus the other:

storm/storm-client/src/jvm/org/apache/storm/stats/StatsUtil.java

Lines 1568 to 1574 in 4c42ee3

    
           //if not executor beats, refresh is-timed-out of the cache which is done by master 
        
           if (executorBeats == null) { 
        
               for (Map.Entry<List<Integer>, Map<String, Object>> executorbeat : cache.entrySet()) { 
        
                   Map<String, Object> beat = executorbeat.getValue(); 
        
                   beat.put("is-timed-out", Time.deltaSecs((Integer) beat.get("nimbus-time")) >= timeout); 
        
               } 
        
               return;

I think the key here is we used a forEach interation for the cache, so here, we could change it to a iterator loop, which is okey cause we only need final consistency instead of ConcurrentMap or copy which will cause perf regression.

srdo · 2018-08-26T07:22:49Z

I don't think fixing the executorBeats == null branch is enough. As far as I can tell, two supervisors/workers can be in the

storm/storm-client/src/jvm/org/apache/storm/stats/StatsUtil.java

Lines 1576 to 1579 in 4c42ee3

    
           //else refresh nimbus-time and executor-reported-time by heartbeats reporting 
        
           for (List<Integer> executor : executors) { 
        
               cache.put(executor, updateExecutorCache(cache.get(executor), executorBeats.get(executor), timeout)); 
        
           }

branch at the same time for the same topology. We won't get an exception if this happens, but we'll still be modifying a HashMap from two threads at the same time, which isn't safe.

Regarding fixing the executorBeats == null branch, it isn't enough to switch to an iterator, since iterators have the same behavior as a forEach loop (throws exception if underlying collection is concurrently modified).

srdo · 2018-08-26T07:32:36Z

Regarding performance, consider that Nimbus is already copying heartbeatCache on writes everywhere else https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/daemon/nimbus/Nimbus.java#L4636-L4639.

Changing StatsUtil.updateHeartbeatCache to return a new Map and using Assoc to update heartbeatCache would be my preferred solution.

danny0405 · 2018-08-26T09:32:43Z

@srdo
Agree that copy is a better solution.

Cause we only need a final consistency, fix the executorBeats == null branch is enough, but i also agree to keep the other branch thread safe in order this will not confuse.

srdo · 2018-08-26T19:52:59Z

@danny0405 Could you elaborate on why fixing the executorBeats == null branch is enough? My concern is that the other branch modifies a HashMap (the cache parameter) from multiple threads with no synchronization. Why is this safe?

danny0405 · 2018-08-27T01:55:31Z

@srdo
Cause only modify it through multi-threads but not iterator over it, and the cache key is executor-id, which will only have conflict between master and supervisor.

srdo · 2018-08-27T05:53:21Z

@danny0405 That doesn't sound safe to me. I think you're right that it works fine most of the time, but if there are key collisions or an insert leads the map to get resized, I would think that two threads modifying the map at the same time could interfere with each other.

Either way, if you're okay with making the whole function thread safe, I think we should do it.

srdo · 2018-09-11T20:14:20Z

@zd-project I'd like to finish this up. Let me know if you want to make the last couple of fixes, otherwise I'll open a new PR containing this fix.

@danny0405 I thought about it a bit more, and while I still think we can fix this by making updateHeartbeatCache thread safe by making it return a new map and keeping the pre-this-PR AtomicReference in Nimbus, I'm not sure why this would be faster than just using a ConcurrentHashMap like the current PR code here does? Using the AtomicReference in Nimbus essentially makes the heartbeat cache a copy-on-write Map due to the way we do updates via Assoc and Dissoc. I would expect a ConcurrentHashMap to provide better parallelism. What do you think?

zd-project · 2018-09-11T20:32:09Z

Agreed. I think atomic reference is really just for Clojure compatibility there. I’ll finish this up.

danny0405 · 2018-09-12T01:45:10Z

@srdo
I think the only difference is that compared to AtomicReference, ConcurrentHashMap can keep thread safe but can not ensure that the value read is up to date, which will cause some inconsistent behavior，cause the scheduling thread will read it every 10s, with an out of date heartbeat cache, master will kill a fine worker or restart an already started one.

But the ConcurrentHashMap will have more specific lock granularityand less susceptible to resource contention, and is likely to be the more performant of the two.

srdo · 2018-09-12T16:58:55Z

@danny0405 Thanks for explaining. I'm not sure I understand why the scheduling thread will see older values with ConcurrentHashMap than with AtomicReference? It was my understanding that ConcurrentHashMap had the same happens-before guarantees as volatile variables for reads/writes?

danny0405 · 2018-09-13T02:17:21Z

@srdo
For thread visibility, i think the 2 are the same.

I think ConcurrentHashMap is a better choice for performance.

revans2 · 2018-09-15T20:14:38Z

Please take a look at #2836 as an alternative. It is a much bigger patch, but I think the refactoring it does will make things much easier longer term. Having done the other patch I think this one does fix the immediate issue, but the current HB cache is so complex that I didn't feel like I understood all of the ways the cache was accessed before doing the other patch.

STORM-3162: Fix nimbus_test

zd-project · 2018-09-15T20:43:03Z

As far as my understanding of the code goes I believe this fix should resolve this particular issue. However my understanding in HB cache was limited and the PR was pushed hastily during my last few days of internship. I definitely support switching to the alternative if it actually solves the deeper structural issue.

Please let me know if there's anything else I can help with.

srdo · 2019-04-15T16:44:34Z

Closing, see reasoning at https://lists.apache.org/thread.html/8f694615d3397a37c54175302ad35663483ed48f1ec51509f8a89318@%3Cdev.storm.apache.org%3E

STORM-3162: Fix concurrent modification bug

e6d91d3

zd-project force-pushed the STORM-3162 branch from e7bacee to e6d91d3 Compare August 10, 2018 19:20

srdo reviewed Aug 16, 2018

View reviewed changes

STORM-3162: Fix nimbus_test

278b097

danny0405 reviewed Aug 25, 2018

View reviewed changes

revans2 mentioned this pull request Sep 15, 2018

STORM-3162: Cleanup heartbeats cache and make it thread safe #2836

Merged

Merge pull request #1 from srdo/STORM-3162-wip

1b577b1

STORM-3162: Fix nimbus_test

srdo closed this Apr 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STORM-3162: Fix concurrent modification bug #2800

STORM-3162: Fix concurrent modification bug #2800

zd-project commented Aug 10, 2018

srdo left a comment •

edited

Loading

srdo Aug 16, 2018

danny0405 Aug 25, 2018

zd-project Aug 25, 2018

danny0405 Aug 25, 2018

srdo Aug 16, 2018

srdo Aug 16, 2018

srdo commented Aug 16, 2018

danny0405 Aug 25, 2018

zd-project Aug 25, 2018

zd-project Aug 25, 2018

danny0405 Aug 25, 2018 •

edited

Loading

danny0405 Aug 25, 2018

danny0405 commented Aug 25, 2018

srdo commented Aug 25, 2018

danny0405 commented Aug 26, 2018

srdo commented Aug 26, 2018

srdo commented Aug 26, 2018

danny0405 commented Aug 26, 2018 •

edited

Loading

srdo commented Aug 26, 2018

danny0405 commented Aug 27, 2018

srdo commented Aug 27, 2018

srdo commented Sep 11, 2018

zd-project commented Sep 11, 2018

danny0405 commented Sep 12, 2018 •

edited

Loading

srdo commented Sep 12, 2018

danny0405 commented Sep 13, 2018 •

edited

Loading

revans2 commented Sep 15, 2018

zd-project commented Sep 15, 2018

srdo commented Apr 15, 2019

STORM-3162: Fix concurrent modification bug #2800

STORM-3162: Fix concurrent modification bug #2800

Conversation

zd-project commented Aug 10, 2018

srdo left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srdo commented Aug 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danny0405 Aug 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danny0405 commented Aug 25, 2018

srdo commented Aug 25, 2018

danny0405 commented Aug 26, 2018

srdo commented Aug 26, 2018

srdo commented Aug 26, 2018

danny0405 commented Aug 26, 2018 • edited Loading

srdo commented Aug 26, 2018

danny0405 commented Aug 27, 2018

srdo commented Aug 27, 2018

srdo commented Sep 11, 2018

zd-project commented Sep 11, 2018

danny0405 commented Sep 12, 2018 • edited Loading

srdo commented Sep 12, 2018

danny0405 commented Sep 13, 2018 • edited Loading

revans2 commented Sep 15, 2018

zd-project commented Sep 15, 2018

srdo commented Apr 15, 2019

srdo left a comment •

edited

Loading

danny0405 Aug 25, 2018 •

edited

Loading

danny0405 commented Aug 26, 2018 •

edited

Loading

danny0405 commented Sep 12, 2018 •

edited

Loading

danny0405 commented Sep 13, 2018 •

edited

Loading