Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix coordinator loadStatus performance #5632

Merged
merged 6 commits into from Apr 12, 2018
Merged

Conversation

jon-wei
Copy link
Contributor

@jon-wei jon-wei commented Apr 11, 2018

getLoadStatus() in DruidCoordinator determines how many segments need to be loaded per-datasource by retrieving the list of all segments from metadata, then asks each server for its segment inventory and removes the server segments from the set returned from metadata:

      // remove loaded segments
      for (DruidServer druidServer : serverInventoryView.getInventory()) {
        final DruidDataSource loadedView = druidServer.getDataSource(dataSource.getName());
        if (loadedView != null) {
          segments.removeAll(loadedView.getSegments());
        }
      }

The current code can perform badly when a server has the entire set of segments for a datasource.

This is the code for AbstractSet.removeAll():

    public boolean removeAll(Collection<?> c) {
        Objects.requireNonNull(c);
        boolean modified = false;

        if (size() > c.size()) {
            for (Iterator<?> i = c.iterator(); i.hasNext(); )
                modified |= remove(i.next());
        } else {
            for (Iterator<?> i = iterator(); i.hasNext(); ) {
                if (c.contains(i.next())) {
                    i.remove();
                    modified = true;
                }
            }
        }
        return modified;
    }

In that situation, the else block will be used, and this is a problem because loadedView.getSegments() is the values collection of a ConcurrentHashMap, and its contains() method does a full traversal of the map:

  public Collection<DataSegment> getSegments()
  {
    return Collections.unmodifiableCollection(idToSegmentMap.values());
  }

...
    public static <T> Collection<T> unmodifiableCollection(Collection<? extends T> c) {
        return new UnmodifiableCollection<>(c);
    }
    public boolean contains(Object o)   {return c.contains(o);}
...
    static final class ValuesView<K,V> extends CollectionView<K,V,V>
        implements Collection<V>, java.io.Serializable {

        ValuesView(ConcurrentHashMap<K,V> map) { super(map); }

        public final boolean contains(Object o) {
            return map.containsValue(o);
        }
...
    /**
     * Returns {@code true} if this map maps one or more keys to the
     * specified value. Note: This method may require a full traversal
     * of the map, and is much slower than method {@code containsKey}.
     *
     * @param value value whose presence in this map is to be tested
     * @return {@code true} if this map maps one or more keys to the
     *         specified value
     * @throws NullPointerException if the specified value is null
     */
    public boolean containsValue(Object value) {
        if (value == null)
            throw new NullPointerException();
        Node<K,V>[] t;
        if ((t = table) != null) {
            Traverser<K,V> it = new Traverser<K,V>(t, t.length, 0, t.length);
            for (Node<K,V> p; (p = it.advance()) != null; ) {
                V v;
                if ((v = p.val) == value || (v != null && value.equals(v)))
                    return true;
            }
        }
        return false;
    }
...

The following benchmark (included in the PR) shows this behavior:

Benchmark                       (serverHasAllSegments)  (totalSegmentsCount)  Mode  Cnt        Score       Error  Units
LoadStatusBenchmark.newVersion                    true                 10000  avgt   50      387.188 ±     4.991  us/op
LoadStatusBenchmark.newVersion                   false                 10000  avgt   50      227.884 ±     2.444  us/op
LoadStatusBenchmark.oldVersion                    true                 10000  avgt   50  1581777.362 ± 38030.835  us/op
LoadStatusBenchmark.oldVersion                   false                 10000  avgt   50      244.024 ±     2.923  us/op

@gianm
Copy link
Contributor

gianm commented Apr 12, 2018

The current code can perform badly when a server has the entire set of segments for a datasource.

This is a pretty neat failure mode!

1581777.362 ± 38030.835

And it sounds like "perform badly" is an understatement.

@@ -297,7 +297,9 @@ boolean hasLoadPending(final String dataSource)
for (DruidServer druidServer : serverInventoryView.getInventory()) {
final DruidDataSource loadedView = druidServer.getDataSource(dataSource.getName());
if (loadedView != null) {
segments.removeAll(loadedView.getSegments());
for (DataSegment serverSegment : loadedView.getSegments()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deserves a comment so someone doesn't "simplify" it back into the old code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added a comment

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gianm
Copy link
Contributor

gianm commented Apr 12, 2018

@jon-wei is there anywhere else using removeAll that looks suspicious? (IMO, given what we see here, anything where the argument is not a Set is suspicious.)

@jon-wei
Copy link
Contributor Author

jon-wei commented Apr 12, 2018

@gianm

KafkaSupervisor calls removeAll in checkPendingCompletionTasks, where both collections are ArrayLists, containing task groups

PendingTaskBasedWorkerProvisioningStrategy and SimpleWorkerProvisioningStrategy have instances where Set.removeAll is called with a List, containing worker IDs

@fjy fjy closed this Apr 12, 2018
@fjy fjy reopened this Apr 12, 2018
@jon-wei jon-wei added this to the 0.12.1 milestone Apr 12, 2018
@jon-wei jon-wei merged commit e91add6 into apache:master Apr 12, 2018
jon-wei added a commit to jon-wei/druid that referenced this pull request Apr 12, 2018
* Optimize coordinator loadStatus

* Add comment

* Fix teamcity

* Checkstyle

* More checkstyle

* Checkstyle
jon-wei added a commit that referenced this pull request Apr 12, 2018
* Optimize coordinator loadStatus

* Add comment

* Fix teamcity

* Checkstyle

* More checkstyle

* Checkstyle
gianm pushed a commit to implydata/druid-public that referenced this pull request Apr 16, 2018
* Optimize coordinator loadStatus

* Add comment

* Fix teamcity

* Checkstyle

* More checkstyle

* Checkstyle
sathishsri88 pushed a commit to sathishs/druid that referenced this pull request May 8, 2018
* Optimize coordinator loadStatus

* Add comment

* Fix teamcity

* Checkstyle

* More checkstyle

* Checkstyle
gianm pushed a commit to implydata/druid-public that referenced this pull request May 16, 2018
* Optimize coordinator loadStatus

* Add comment

* Fix teamcity

* Checkstyle

* More checkstyle

* Checkstyle
@dclim dclim modified the milestones: 0.12.1, 0.12.0 Oct 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants