Fix coordinator loadStatus performance #5632

jon-wei · 2018-04-11T23:58:56Z

getLoadStatus() in DruidCoordinator determines how many segments need to be loaded per-datasource by retrieving the list of all segments from metadata, then asks each server for its segment inventory and removes the server segments from the set returned from metadata:

      // remove loaded segments
      for (DruidServer druidServer : serverInventoryView.getInventory()) {
        final DruidDataSource loadedView = druidServer.getDataSource(dataSource.getName());
        if (loadedView != null) {
          segments.removeAll(loadedView.getSegments());
        }
      }

The current code can perform badly when a server has the entire set of segments for a datasource.

This is the code for AbstractSet.removeAll():

    public boolean removeAll(Collection<?> c) {
        Objects.requireNonNull(c);
        boolean modified = false;

        if (size() > c.size()) {
            for (Iterator<?> i = c.iterator(); i.hasNext(); )
                modified |= remove(i.next());
        } else {
            for (Iterator<?> i = iterator(); i.hasNext(); ) {
                if (c.contains(i.next())) {
                    i.remove();
                    modified = true;
                }
            }
        }
        return modified;
    }

In that situation, the else block will be used, and this is a problem because loadedView.getSegments() is the values collection of a ConcurrentHashMap, and its contains() method does a full traversal of the map:

  public Collection<DataSegment> getSegments()
  {
    return Collections.unmodifiableCollection(idToSegmentMap.values());
  }

...
    public static <T> Collection<T> unmodifiableCollection(Collection<? extends T> c) {
        return new UnmodifiableCollection<>(c);
    }
    public boolean contains(Object o)   {return c.contains(o);}
...
    static final class ValuesView<K,V> extends CollectionView<K,V,V>
        implements Collection<V>, java.io.Serializable {

        ValuesView(ConcurrentHashMap<K,V> map) { super(map); }

        public final boolean contains(Object o) {
            return map.containsValue(o);
        }
...
    /**
     * Returns {@code true} if this map maps one or more keys to the
     * specified value. Note: This method may require a full traversal
     * of the map, and is much slower than method {@code containsKey}.
     *
     * @param value value whose presence in this map is to be tested
     * @return {@code true} if this map maps one or more keys to the
     *         specified value
     * @throws NullPointerException if the specified value is null
     */
    public boolean containsValue(Object value) {
        if (value == null)
            throw new NullPointerException();
        Node<K,V>[] t;
        if ((t = table) != null) {
            Traverser<K,V> it = new Traverser<K,V>(t, t.length, 0, t.length);
            for (Node<K,V> p; (p = it.advance()) != null; ) {
                V v;
                if ((v = p.val) == value || (v != null && value.equals(v)))
                    return true;
            }
        }
        return false;
    }
...

The following benchmark (included in the PR) shows this behavior:

Benchmark                       (serverHasAllSegments)  (totalSegmentsCount)  Mode  Cnt        Score       Error  Units
LoadStatusBenchmark.newVersion                    true                 10000  avgt   50      387.188 ±     4.991  us/op
LoadStatusBenchmark.newVersion                   false                 10000  avgt   50      227.884 ±     2.444  us/op
LoadStatusBenchmark.oldVersion                    true                 10000  avgt   50  1581777.362 ± 38030.835  us/op
LoadStatusBenchmark.oldVersion                   false                 10000  avgt   50      244.024 ±     2.923  us/op

gianm · 2018-04-12T00:31:00Z

The current code can perform badly when a server has the entire set of segments for a datasource.

This is a pretty neat failure mode!

1581777.362 ± 38030.835

And it sounds like "perform badly" is an understatement.

gianm · 2018-04-12T00:31:33Z

server/src/main/java/io/druid/server/coordinator/DruidCoordinator.java

@@ -297,7 +297,9 @@ boolean hasLoadPending(final String dataSource)
      for (DruidServer druidServer : serverInventoryView.getInventory()) {
        final DruidDataSource loadedView = druidServer.getDataSource(dataSource.getName());
        if (loadedView != null) {
-          segments.removeAll(loadedView.getSegments());
+          for (DataSegment serverSegment : loadedView.getSegments()) {


This deserves a comment so someone doesn't "simplify" it back into the old code.

Good point, added a comment

gianm

LGTM

gianm · 2018-04-12T00:39:03Z

@jon-wei is there anywhere else using removeAll that looks suspicious? (IMO, given what we see here, anything where the argument is not a Set is suspicious.)

jon-wei · 2018-04-12T02:50:37Z

@gianm

KafkaSupervisor calls removeAll in checkPendingCompletionTasks, where both collections are ArrayLists, containing task groups

PendingTaskBasedWorkerProvisioningStrategy and SimpleWorkerProvisioningStrategy have instances where Set.removeAll is called with a List, containing worker IDs

* Optimize coordinator loadStatus * Add comment * Fix teamcity * Checkstyle * More checkstyle * Checkstyle

Optimize coordinator loadStatus

dee1736

jon-wei added Performance Area - Segment Balancing/Coordination labels Apr 11, 2018

gianm reviewed Apr 12, 2018

View reviewed changes

Add comment

9012a08

gianm approved these changes Apr 12, 2018

View reviewed changes

jihoonson approved these changes Apr 12, 2018

View reviewed changes

Fix teamcity

6a153ea

fjy closed this Apr 12, 2018

fjy reopened this Apr 12, 2018

jon-wei added 2 commits April 12, 2018 11:48

Checkstyle

5e68520

More checkstyle

555df63

jon-wei added this to the 0.12.1 milestone Apr 12, 2018

Checkstyle

c9c1775

jon-wei merged commit e91add6 into apache:master Apr 12, 2018

jon-wei added a commit to jon-wei/druid that referenced this pull request Apr 12, 2018

Fix coordinator loadStatus performance (apache#5632)

f7c6e5c

* Optimize coordinator loadStatus * Add comment * Fix teamcity * Checkstyle * More checkstyle * Checkstyle

jon-wei mentioned this pull request Apr 12, 2018

[Backport] Fix coordinator loadStatus performance (#5632) #5636

Merged

jon-wei added a commit that referenced this pull request Apr 12, 2018

Fix coordinator loadStatus performance (#5632) (#5636)

5936b40

* Optimize coordinator loadStatus * Add comment * Fix teamcity * Checkstyle * More checkstyle * Checkstyle

gianm pushed a commit to implydata/druid-public that referenced this pull request Apr 16, 2018

Fix coordinator loadStatus performance (apache#5632)

6d99411

* Optimize coordinator loadStatus * Add comment * Fix teamcity * Checkstyle * More checkstyle * Checkstyle

jihoonson mentioned this pull request May 4, 2018

Druid 0.12.1 release notes #5743

Closed

sathishsri88 pushed a commit to sathishs/druid that referenced this pull request May 8, 2018

Fix coordinator loadStatus performance (apache#5632)

12b3e7b

* Optimize coordinator loadStatus * Add comment * Fix teamcity * Checkstyle * More checkstyle * Checkstyle

gianm pushed a commit to implydata/druid-public that referenced this pull request May 16, 2018

Fix coordinator loadStatus performance (apache#5632)

ebbe094

* Optimize coordinator loadStatus * Add comment * Fix teamcity * Checkstyle * More checkstyle * Checkstyle

dclim modified the milestones: 0.12.1, 0.12.0 Oct 8, 2018

leventov mentioned this pull request Dec 6, 2018

Added an allocation rate metric #6604 #6710

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix coordinator loadStatus performance #5632

Fix coordinator loadStatus performance #5632

jon-wei commented Apr 11, 2018

gianm commented Apr 12, 2018

gianm Apr 12, 2018

jon-wei Apr 12, 2018

gianm left a comment

gianm commented Apr 12, 2018

jon-wei commented Apr 12, 2018

Fix coordinator loadStatus performance #5632

Fix coordinator loadStatus performance #5632

Conversation

jon-wei commented Apr 11, 2018

gianm commented Apr 12, 2018

gianm Apr 12, 2018

Choose a reason for hiding this comment

jon-wei Apr 12, 2018

Choose a reason for hiding this comment

gianm left a comment

Choose a reason for hiding this comment

gianm commented Apr 12, 2018

jon-wei commented Apr 12, 2018