Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-2306 : Messaging subsystem redesign. #2241

Closed
wants to merge 34 commits into from

Conversation

roshannaik
Copy link
Contributor

Having spent a lot of time on this, I am happy to share some good news and some even better news with you.

Before venturing further, I must add, to limit the scope of this PR, no attempt was made to improve ACK-ing mode perf. Although there are some big latency gains seen in ACK mode, these are a side effect of the new messaging design and work remains to be done to improve ACK mode.

Please see the design docs posted on the STORM-2306 jira for details on what is being done

So, first the good news .. a quick competitive evaluation:

1) Competitive Perf evaluation :

Here are some quick comparison of Storm numbers taken on my laptop against numbers for similar/identical topologies published by Heron, Flink and Apex. Shall provide just rolled up summary here and leave the detailed analysis for later.

Storm numbers here were run on my MacBook Pro (2015) with 16GB ram and a single 4 core Intel i7 chip.

A) Compared To Heron and Flink:


Heron recently published this blog about the big perf improvements (~4-6x) they achieved.
https://blog.twitter.com/engineering/en_us/topics/open-source/2017/optimizing-twitter-heron.html
They ran it on dual 12-core Intel Xeon chips (didn't say how many machines).

They use a simplified word count topology that I have emulated for comparison purposes and included it as part of this PR (SimplifiedWordCountTopo).

Flink also publishes numbers for a similar setup here
https://flink.apache.org/features.html#streaming

Below are per core throughput numbers.

[:HERON:]
Acking Disabled: per core = ~475 k/sec.
Acking Enabled: per core = ~150 k/sec. Latency = 30ms

[:FLINK:]
Per core: ~1 mill/sec

[:STORM:]
Acking Disabled: per core = 2 mill/sec. (1 spout & 1 counter bolt)
Acking Enabled: per core = 0.6 mill/sec, Latency = 0.73 ms (+1 acker)

Takeaways:

B) Compared to Apex:


Apex appears to be the best performer among the opensource lot.. by a reasonably good margin. Some numbers they published in their early days (when it was called DataTorrent) were misleading/dubious IMO, but the newer numbers appear credible.

Here we look at how fast inter spout/bolt communication can be achieved using an ultra minimalist topology.
A ConstSpout emits a short string to a DevNull bolt that discards the tuples it receives. This topo has been in storm-perf for sometime now.

Apex provides numbers for a identical setup ... what they call "container local" performance here:
https://www.datatorrent.com/blog/blog-apex-performance-benchmark/

Other than the fact that Storm numbers were on my laptop, these numbers are a good apples to apples comparison.

[:APEX:]
Container local Throughput : ~4.2 mill/sec

[:STORM:]
Worker local throughput : 8.1 mill/sec

2) Core messaging Performance

Now for the better news. The redesigned messaging system is actually much faster and able to move messages between threads at an astounding rate .... :

  • 120 mill/sec (batchSz=1000, 2 producers writing to 1 consumer).
  • 67 mill/sec (batchSz=1000, 1 producers writing to 1 consumer).

I have included JCQueuePerfTest.java in this PR to help get quick measurements from within the IDE.

That naturally begs the question .. why is Storm pushing only 8.1 mill/sec between a ConstSpout and DevNullBolt ? The short answer is ... there are big bottlenecks in other parts of the code. In this PR I have tackled some such bottlenecks but many still remain. We are faster than the competition, but still have room to be much much faster. If anyone is interested in pursuing these to push Storm's perf to the next level, I am happy to point them in the right direction.

Again, please refer to the design docs in the JIRA for details on the new design and the rationale behind them.

Copy link
Member

@satishd satishd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice Work Roshan!
Had an initial look at the code and left few comments(mostly minor).

@@ -146,7 +149,7 @@ supervisor.run.worker.as.user: false
#how long supervisor will wait to ensure that a worker process is started
supervisor.worker.start.timeout.secs: 120
#how long between heartbeats until supervisor considers that worker dead and tries to restart it
supervisor.worker.timeout.secs: 30
supervisor.worker.timeout.secs: 30000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a deliberate change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no... i need to revert that.

topology.transfer.batch.size: 10
topology.executor.receive.buffer.size: 50000
topology.producer.batch.size: 1000 # TODO: Roshan: rename
topology.flush.tuple.freq.millis: 5000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Better to add a comment describing about this property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was planning to document it somewhere. Do we have a place to document Storm settings ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK if they're added to Config / DaemonConfig and had javadoc on each fields. Other configurations are also documented to the class.

@@ -304,6 +303,7 @@ storm.cgroup.resources:
storm.cgroup.hierarchy.name: "storm"
storm.supervisor.cgroup.rootdir: "storm"
storm.cgroup.cgexec.cmd: "/bin/cgexec"
storm.cgroup.cgexec.cmd: "/bin/cgexec"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be an accidental copy, needs to be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@@ -30,4 +31,5 @@
void ack(Tuple input);
void fail(Tuple input);
void resetTimeout(Tuple input);
void flush();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to add some javadoc about the same. It seems we are ready to break the APIs with new set of changes in this redesign.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an external facing api.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is kind of external facing API as it is given to bolt implementors in prepare API. IOutputCollector can be wrapped by users to do any custom processing before they are emitted to target tasks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Random passer-by user: E.g. we do this for tests a lot, yes. On the other hand, this is for Storm 2.0, so API breakages are kinda expected. Then again, at least it should be thinked through / discussed.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think they implement that interface... so may not be an issue. we may need folks to just recompile their topos with Storm 2.0 at most.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is like public API, but given that we ship the change to major version only, I guess that might be fine. If we have alternatives it would be better though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we support end users writing custom output collectors ? This is public for internal use only I think.
Right, there was no easy alternative.

LOG.warn("Received invalid messages for unknown tasks. Dropping... ");
private void transferLocalBatch(List<AddressedTuple> tupleBatch) {
try {
for (int i = 0; i < tupleBatch.size(); i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does foreach have significant perf issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 3.8 in the design doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non random access lists the perf will get worse. So change the signature transferLocalBatch to accept ArrayList<AddressedTuple>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes indeed!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice observation @roshannaik and @arunmahadevan!

}

public boolean transferLocal(AddressedTuple tuple) throws InterruptedException {
workerData.checkSerialize(serializer, tuple);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should be done only when it is determined to be sending to local task. So, this should be pushed to just before queue.publish(tuple) below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thanks!

private ArrayList<List<Integer>> choices;
private AtomicInteger current;
private int current = 0 ;
Copy link
Member

@satishd satishd Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You changed this to non threadsafe. Is this instance not shared by multiple executors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chooseTasks is implemented based on the assumption of thread safety, so we should check this again. If it's thread-safe, it is better to leave comment on this.

}
}

public TupleImpl(GeneralTopologyContext context, List<Object> values, int taskId, String streamId, MessageId id) {
public TupleImpl(GeneralTopologyContext context, List<Object> values, String srcComponent, int taskId, String streamId, MessageId id) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to have this to be passed in constructor as that can be derived like below from the existing arguments? This constructor change is spread across all usages.
srcComponent = context.getComponentId(taskId)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because its a hashmap lookup. this change was based on profiling.

// Cache the msgs grouped by destination node
public void add(TaskMessage taskMsg) {
int destId = taskMsg.task();
ArrayList<TaskMessage> msgs = bundles.get(destId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can have bundles.computeIfAbsent(destId, integer -> new ArrayList<>()); removing null check, creating list and put that into bundles map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in critical path.. that lambda argument will result in allocation a new (lambda) object.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-capturing lambdas only have a single instance in a generated private static field, so the allocation cost is only paid once.

Of course, there is some extra cost in the indirection ... then again, computeIfAbsent() is optimized in OpenJDK to only do a single hash + table lookup, but if (get() == null) put() does two of each if the get() fails. Overall the difference will be nonexistant / heavily dependant on the prevailing branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 cents: either is fine to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for that sweet tip! just to be sure, i would feel comfortable verifying with profiler first ... given this is in critical path.

addListRefToMap(this.bundles, entry.getKey(), entry.getValue());

// Cache the msgs grouped by destination node
public void add(TaskMessage taskMsg) {
Copy link
Member

@satishd satishd Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems add and send are always invoked in the same thread from (JCQueue.Consumer's accept and flush) and there is no contention. Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right.

@avermeer
Copy link
Contributor

avermeer commented Jul 25, 2017 via email

@satishd
Copy link
Member

satishd commented Jul 25, 2017

@roshannaik Did you run any topologies with trident topologies in distributed mode with these changes?
You may want to look at travis failures some of them looks to be related.

@roshannaik
Copy link
Contributor Author

roshannaik commented Jul 25, 2017

@avermeer Looks like SuperChief blog is relaying the same basic claims that Heron has marketed. Since you ask, i will share my opinions wrt Heron's claims.

  • Heron has never been a player in the high performance club. They have been smart about not comparing themselves with the real top performers of the day. I only included them here because they have made much noise against Storm, otherwise not interesting in a perf shootout.
  • They are smart about not mentioning which version of Storm they are comparing with (how does a paper with such critical info missing get accepted ?). That creates an illusion in people that their perf claims apply to all versions of Storm in general... even if Storm publishes new perf numbers comparing itself to a prior version.
  • Heron's threading model (1 thread per process.. based on what i gather from their articles), is really primitive for this application domain. I don't recommend it, but by setting 'topology.workers' equal to the number of spout& bolt instances, Storm can be run in Heron mode.
  • I find it much easier to debug a process with multiple components using a debugger rather starting separate debuggers for every instance of spout/bolt. Also, I would imagine, having so many processes means you have an explosion of log files to deal with when triaging.
  • Unclear why the recovery model (when worker process crashes) is any better ... the same kind of replay from the spout would be required. The gains may be minor if any. Making minor optimizations to the failure path and penalizing the normal operation path... is backwards.
  • Cant get a stack from a Storm worker ? Thats clearly false. Try it yourself. I do it all the time. Heapdumps, on the other hand, can stall the worker and if the heap size is really large the supervisor might feel the worker is having a problem. There are timeouts that you can increase to for the supervisor to wait longer. I cant imagine that Heron doesn't monitor their workers and restart them if they are not responsive.
  • Heron's Backpressure model is simply too overweight, but marketed as a novel idea.
  • A quick read of their latest perf blog, noted in the comparison, and it was evident that they missed recognizing their real perf problem.

@HeartSaVioR
Copy link
Contributor

@roshannaik
First of all, great work! The numbers are impressive.

Before going into this deeply, I would like to see the comparison between current master branch vs this patch (say, before and after) so that we can see the impact of the patch clearly. IMO this kind of comparison is required basically on every performance patch.

@revans2 brought awesome analysis when introducing disruptor batching.
#765 (comment)
#765 (comment)

It should be great if we can see similar analysis for this patch too, only if you don't really mind. You can put your analysis with details on it.

@ptgoetz
Copy link
Member

ptgoetz commented Jul 25, 2017

Let's not be too hard on the Heron community. Yes, in the past they've not been exactly friendly in terms of marketing and technical claims, but now that Heron is incubating as an Apache project (full disclosure: I'm a mentor), there's a chance that that might change. A collaborative relationship with the Heron community has potential benefit, an adversarial one not so much.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Jul 25, 2017

@avermeer
As @roshannaik stated, we addressed many parts of performance bottleneck on Storm 1.0.0. The paper about Heron and blog post on SuperChief referred old version of Storm so it doesn't even hold for 1.0.0.
(Yes I understand they can't pick 1.0.0 to compare since 1.1.0 is released at Apr. 2016.)

I think we learnt a lot and adopt some parts of them like back pressure from Heron's paper (though we implemented in different way), but personally also wonder of their points of performance gaining, especially 1 computational thread on 1 worker. Message should be serialized even across the workers in a node, and the cost of serialization/deserialization is not that cheap.

Unlike TPC on SQL world, I think there's no general clear way of benchmark in Streaming, so claiming that A is N times faster than B could hide too much detail. Most of frameworks has their own characteristics hence excels on specific use case or performance test. Moreover, developer from A doesn't know about B deeply so most of time fails to set variables optimized specifically for the test.

That's why I prefer comparing with our own history much instead of comparing it with others.

@roshannaik
Copy link
Contributor Author

roshannaik commented Jul 26, 2017

@HeartSaVioR the current master clocks little over 3 mill/sec for the ConstSpoutNullBolt Topo without ACKing. On the latency end, with ACK enabled i saw numbers in multiple seconds with batchSz=1 for master and touching as low ~30 microseconds with the new patch.

More analysis? I thought I might have spent too much time providing measurements of various kinds, both in the design docs and here. :-)

Actually I have only presented a few pruned down numbers that I felt were interesting, in which case I was a bit meticulous to record the numbers. Given the scope of this patch, meticulously recording and cataloging all the perf tests that I have done (both high level profiler data & low level hardware perf counters) and performing the many more tests that I could have done, will take forever. I felt, ROI was -ve to rigorously pursue some of those measurements any further.

If you feel some other critical measurements are necessary, bring it up.

Batching is a small aspect of this patch. I had done some quick tests to determine a default that appears to be a modest sweetspot for high throughput low latency topos, but not enough to provide a rigorous analysis. As batch size is very dependent on topology components, deployment size and ack-mode, IMO the search for a more generally optimal batch size is not important right now.

I think we will have more publish worthy analysis once these changes stabilize and get more testing.

On a related note, about latency and ACK-mode, during my recent measurements, I noticed an issue in ACK mode, both in current master and optimized version, perf degrades with time (for certain batch sizes)... Something I had not noticed on the old version of master on which I had based my changes for a long time. I will share these observations soon.
This needs to be looked into before we have an official release.

Copy link
Contributor

@arunmahadevan arunmahadevan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick pass and left a few comments. Will go through the design doc and do an in-depth review later.

topology.disruptor.wait.timeout.millis: 1000 # TODO: Roshan: not used, but we may/not want this behavior
topology.transfer.buffer.size: 50000
topology.transfer.batch.size: 10
topology.executor.receive.buffer.size: 50000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems you are reusing this config as the JCQueue size? Its good to introduce a different config, since users might have overridden this value and it may not be an apt value for the JCQueue size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name is still good and means the same thing. some other defaults are also likely change in storm 2.0.. i didnt see good reason to rename it. i have renamed any setting with the name 'disruptor' in it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that the name looks good. If the value users might override matters, let's guide a new start point of optimization from release announce doc or another doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI. I do have the notes in the java doc comments as of now. Agree that we need to make a note of the such settings for which older/existing values may need review along with the release announce doc.

@@ -257,7 +258,7 @@
<!--Note - the version would be inherited-->
<configuration>
<excludes>**/generated/**</excludes>
<maxAllowedViolations>10785</maxAllowedViolations>
<maxAllowedViolations>50785</maxAllowedViolations>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theres no point in having a standard if we keep violating it. Please fix the checkstyle issues with the new code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was intended to get a PR out quickly for others to being reviewing. these fixes can wait for the PR updates. will get fixed surely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly reminder, given that we are getting close.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. fixed internally. will be part of next update.

@@ -137,7 +137,7 @@ public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<I
public static class FieldsGrouper implements CustomStreamGrouping {

private Fields outFields;
private List<Integer> targetTasks;
private ArrayList<List<Integer> > targetTasks;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you are using some specific method of the implementation its always better to write code against the interfaces.

try {
for (int i = 0; i < tupleBatch.size(); i++) {
AddressedTuple tuple = tupleBatch.get(i);
JCQueue queue = shortExecutorReceiveQueueMap.get(tuple.dest);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are optimizing the critical path, you might want to use an ArrayList or array for indexed based lookups than a Map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it did show up as a bottleneck during profiling. I recall trying to replace it with an ArrayList... and then backing out as it started getting a bit complicated... i think it had something to do with this map getting periodically refreshed. Let me take a look at it again.

LOG.warn("Received invalid messages for unknown tasks. Dropping... ");
private void transferLocalBatch(List<AddressedTuple> tupleBatch) {
try {
for (int i = 0; i < tupleBatch.size(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non random access lists the perf will get worse. So change the signature transferLocalBatch to accept ArrayList<AddressedTuple>.

sendSize, waitTimeOutMs, batchSize, batchTimeOutMs);
// Called by flush-tuple-timer thread
public boolean publishFlushTuple() {
TupleImpl tuple = new TupleImpl(workerTopologyContext, new Values(), Constants.SYSTEM_COMPONENT_ID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be constructed newly for each call? If this is a fixed value you can make it a static final value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure... although this allocation doesn't matter much for perf.. as its not in critical path. Its invoked once in a while by the timer.

public boolean publishFlushTuple() {
TupleImpl tuple = new TupleImpl(workerTopologyContext, new Values(), Constants.SYSTEM_COMPONENT_ID,
(int) Constants.SYSTEM_TASK_ID, Constants.SYSTEM_FLUSH_STREAM_ID);
AddressedTuple flushTuple = new AddressedTuple(AddressedTuple.BROADCAST_DEST, tuple);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be constructed newly for each call? If this is a fixed value you can make it a static final value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right.. i dont think it is necessary.

(int) Constants.SYSTEM_TASK_ID, Constants.SYSTEM_FLUSH_STREAM_ID);
AddressedTuple flushTuple = new AddressedTuple(AddressedTuple.BROADCAST_DEST, tuple);
if( receiveQueue.tryPublish(flushTuple) ) {
LOG.debug("Published Flush tuple to: {} ", getComponentId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: getComponentId will get evaluated irrespective of the log level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats fine... its not doing any lookups. nor is it in critical path.

return true;
}
else {
LOG.debug("RecvQ is currently full, will retry later. Unable to publish Flush tuple to : ", getComponentId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: getComponentId will get evaluated irrespective of the log level

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{} is missing in log message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

@HeartSaVioR
Copy link
Contributor

@roshannaik
I didn't mean to require additional load on you. I just would like to see comparison against Storm itself with suppressing variables, not competing with papers, to see impact of this patch clearly.
I'll try to run performance tests (maybe ThroughputVsLatency or something in perf module) and share the result.

I'll go through the code first.

public void add(Object obj) throws InterruptedException {
boolean inserted = q.publishInternal(obj);
while(!inserted) {
Thread.yield();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to busy spin for few times and then yield.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some other changes coming to that code soon.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done for the first pass.
Lots of style violation observed but I guess we can address them after having some +1s on merging.
I didn't look at classes regarding JCQueue implementation since I don't have any knowledge on JCQueue. I'll try to see it later, but feel OK to skip given that there're at least two other reviewers reviewing this thoughtfully.

@@ -49,6 +49,9 @@ storm.nimbus.retry.times: 5
storm.nimbus.retry.interval.millis: 2000
storm.nimbus.retry.intervalceiling.millis: 60000
storm.auth.simple-white-list.users: []
storm.auth.simple-acl.users: []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following three lines seem not relevant of this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.. i think it that got introduced somehow during manual reconciling/rebasing to latest master.

@@ -231,16 +228,13 @@ topology.multilang.serializer: "org.apache.storm.multilang.JsonSerializer"
topology.shellbolt.max.pending: 100
topology.skip.missing.kryo.registrations: false
topology.max.task.parallelism: null
topology.max.spout.pending: null
topology.max.spout.pending: null # TODO: We dont need this any more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're removing max spout pending and also configuration for backpressure. Does this patch also touch the mechanism of backpressure?

And let's address this in patch or file an issue and remove TODO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes back-pressure is addressed. Described in the design doc. A couple little things left related to BP .. 1) Ensure metrics get updated when retrying to publish to downstream Q that is full. 2) Remove max.spout.pending

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it utilizes blocking queue which looks natural for the definition of back pressure. If then we need to be sure there's no livelock and deadlock. AFAIK, Storm used blocking queue before, and ended up adding overflow buffer because of deadlock (dining philosopher).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the critical code path is now lock free so should be no room for live/dead-locks. To be very precise, its not exactly a blocking queue.. as there is a retry loop that we can break out of anytime. There are some changes coming soon in this PR.. wrt to what happens when Q is full. will also add that info in the design doc.

topology.state.synchronization.timeout.secs: 60
topology.stats.sample.rate: 0.05
topology.stats.sample.rate: 0.001
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the change intended?
This is OK to change it while doing performance test (especially micro perf), but 0.1% might be too small for production. Even users complained that only 5% of tuples are measured.
Other frameworks using dropwizard metrics even don't sample for meters, and I guess we will follow the rule on new metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was definitely impacting perf. There may be some tweaking needed in the sampler code (not sure). Also Sampling 5 out of every 100 seemed bit excessive even for topos doing several thousand per second. Does 1 out of every 100 seem more reasonable ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise I said some frameworks even don't sample and measure all for meters. JStorm is one of example, and they (Alibaba) said "every day the JStorm Cluster handles 1.5 PB/2 Trillion messages" in 2015, which is 23,000,000 messages per seconds. Yes we don't know how many servers they use, but they also claim their performance in website http://jstorm.io:8080/Performance/ and metrics seems not make considerable bottleneck.

One thing to consider is that current metrics implementation may be really slower than dropwizard thing, so we may want to adjust rate if we believe it is really a bottleneck.

Btw, most of the cases users would want to have operational convenience over performance. We should consider first what users are facing day by day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I agree ... thats why I ask. Does 1 out of every 100 seem reasonable ? Or do you feel we should continue to sample 5 out of 100 ? This change is based on impact observed during perf runs.... but if it -vely impacts operational convenience then worth reverting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I don't feel we need to change it if it's only for performance. We already know the sampling rate affects performance, and tolerate it.
Btw, likewise I commented, I think we have touched so much variables irrespective of the message subsystem and JCQueue. How you can prove introducing JCQueue brings better throughput and/or lower latency in this patch? How about message subsystem? Let's not touch variables and address them in separate issue or patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I personally feel that sampling can be reduced a bit, but if you have a preference towards retaining it .... i am ok with that. JCQueue is at the core messaging system .. not separate. the numbers demonstrate the improvement.

topology.disruptor.batch.size: 100
topology.disruptor.batch.timeout.millis: 1
topology.disable.loadaware.messaging: false
topology.bolts.outgoing.overflow.buffer.enable: false # TODO: Roshan : Whats this ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 0.9.x we don't have overflow buffer for bolts and it could make deadlock (philosopher dining problem), so introduced overflow buffer, but it then opens the chance of OOM, so set it to false by default.

After introducing disruptor batching and back pressure, we already made the queue unbounded so don't need the configuration any more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sounds like we can remove this in the PR.

topology.disruptor.wait.timeout.millis: 1000 # TODO: Roshan: not used, but we may/not want this behavior
topology.transfer.buffer.size: 50000
topology.transfer.batch.size: 10
topology.executor.receive.buffer.size: 50000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that the name looks good. If the value users might override matters, let's guide a new start point of optimization from release announce doc or another doc.

if (s==null) // then stop running it
break;
if (s>0)
Thread.sleep(s); // TODO: Roshan : need to do something about sleep strategy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Time.sleepSecs() to be compatible with time simulation. And please check its unit was second, but unit is effectively changed to milliseconds in this change.

@@ -291,7 +291,7 @@ private static TupleWindow makeTupleWindow(ArrayList<Tuple>... streams) {
MockContext mockContext = new MockContext(fieldNames);

for (Object[] record : data) {
TupleImpl rec = new TupleImpl(mockContext, Arrays.asList(record), 0, streamName);
TupleImpl rec = new TupleImpl(mockContext, Arrays.asList(record), "testSrc", 0, streamName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put component id to the parameter of makeStream if possible.

@@ -313,7 +313,7 @@ private static TupleWindow makeTupleWindow(ArrayList<Tuple>... streams) {

ArrayList<Object> tupleValues = new ArrayList<>(1);
tupleValues.add(recordMap);
TupleImpl tuple = new TupleImpl(mockContext, tupleValues, 0, streamName);
TupleImpl tuple = new TupleImpl(mockContext, tupleValues, "testSrc", 0, streamName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put component id to the parameter of makeNestedEventsStream if possible.

@@ -78,7 +78,7 @@ public Fields getComponentOutputFields(String componentId, String streamId) {
}

private Tuple getTuple(String streamId, final Fields fields, Values values) {
return new TupleImpl(getContext(fields), values, 1, streamId) {
return new TupleImpl(getContext(fields), values, "testSrc", 1, streamId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put component id to the parameter of getTuple if possible.

@@ -712,6 +712,6 @@ public static Tuple testTuple(List<Object> values, MkTupleParam param) {
new HashMap<>(),
new HashMap<>(),
new AtomicBoolean(false));
return new TupleImpl(context, values, 1, stream);
return new TupleImpl(context, values, "testSrc", 1, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use component instead of "testSrc".

@HeartSaVioR
Copy link
Contributor

Btw, we should be careful if we drop something we provided. This patch seems focus to performance optimization (especially micro), and for some spots it might not be always better. One example is disabling load aware by default.

@revans2
Copy link
Contributor

revans2 commented Jul 26, 2017

-1

Perhaps I am running into some odd issues here so if I can be corrected I would be happy to change my vote, but nothing I have run with this patch is better in any way. Are all of the results from micro benchmarks? Did anyone run a real topology with this patch before posting all of these wonderful results to twitter? I am not one to swear but WTF?

I built a stock 2.0.0-SNAPSHOT build (450ed63) and compared it to the exact same release with this patch merged on top of it (which was a clean merge). I am running

$ java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Sierra 10.12.6
MacBook Pro (Retina, 15-inch, Mid 2015)
2.8 GHz Intel Core i7
16 GB 1600 MHz DDR3

I ran the ThroughputVsLatency topology with several different options and no changes at all to the default storm.yaml.

With this patch I found that.

  1. Running a topology with more than one worker appears to not be able to send any messages between workers (or it takes so long most of the messages time out). So I switched all of my tests to a single worker.
  2. When processing a nearly idle topology (500 sentences/second) the CPU utilization was everything that my box could give it. (8 cores fully utilized) compared to about one half of one core used by stock storm.
  3. The latency is absolutely horrible. The minimum latency for a somewhat idle topology was 1 to 4 seconds to do a word count. For a topology processing 10,000 sentences per second it dropped to 800 ms. The maximum latency was 15 seconds for all of these cases. Compare that to stock storm which has a min latency of around 3 to 4 ms for the normal case.
  4. The system bolt metrics do not work, or at lest I was not able to get any of them back. I tried to compare memory and CPU usage through top, which worked out OK.
  5. memory usage is insane. The resident memory was 2 GB for almost all of the workers no matter the throughput. That is 1 GB more than stock storm for the same 10,000 sentences per second.
  6. The maximum throughput is about 1/4th what it is on stock storm. I was able to to 100,000 sentences per second on my laptop. I could do it both with a single worker, and with 4 worker (although the CPU usage was higher in the latter case). With this patch I was able to do 30,000 sentences per second in the best case, but on average it could only do about 25,000.

I am happy to make all of my numbers public. I also plan to run them on a Linux box to see if it is any different.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done with the code review, but until my other concerns are addressed I don't see much reason to do more.

topology.flush.tuple.freq.millis: 5000
topology.spout.recvq.skips: 3 # Check recvQ once every N invocations of Spout's nextTuple() [when ACKs disabled]

topology.disable.loadaware.messaging: true # load aware messaging reduces throughput by ~20%.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that turning it off by default may not be what we want. It really helps on heterogeneous clusters. And quite honestly a lot of the benchmarks that we have been running are no where near what I have seen happen in production. It gives you a nice feeling to say we can do X million tuples per second, but these are for bolts that do close to nothing. Most real bolts I have seen actually take some time to process.

@@ -76,6 +76,8 @@ public static StormTopology getTopology(Map<String, Object> conf) {
public static void main(String[] args) throws Exception {
int runTime = -1;
Config topoConf = new Config();
topoConf.put(Config.TOPOLOGY_SPOUT_RECVQ_SKIPS, 8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are setting this for all of the perf tests, then why is it not the default? Having a benchmark that does not show what people will get out of the box feels a bit fishy to me, even though everyone does it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Higher numbers for that setting are only useful for very low latency spouts. The default setting there assumes that you are likely to be using Kafka/Hdfs kind of spouts which have higher latencies (amount of time spent inside nextTuple() ).

*/
@isPowerOf2
public static final String TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE="topology.executor.receive.buffer.size";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this does not need to be a power of 2 that is fine but we still want an annotation to say what type it is. I think @isPositiveNumber is probably good enough. You might want to make it an @isInteger too if you want to limit it to 32 bits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed! my bad.

@@ -97,6 +97,8 @@ public void run() {
// events.
Time.sleep(1000);
}
if(Thread.interrupted())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeartSaVioR InterruptedException in most places in storm is only used to shut the process down cleanly. This looks fine to me.

for (LoadAwareCustomStreamGrouping grouper : streamComponentToGrouper.get(stream).values()) {
ArrayList<Integer> outTasks = new ArrayList<>();

// TODO: PERF: expensive hashtable lookup in critical path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So like with all TODOs in the code we either need to fix the TODO, remove the TODO because we are not going to do it, or remove the TODO and file a follow on JIRA to do it later.

new WorkerState(conf, context, topologyId, assignmentId, port, workerId, topologyConf, stateStorage,
return loadWorker(topologyConf, stateStorage, stormClusterState, initCreds, initialCredentials);
}
}); // Subject.doAs(...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If we are changing this can we just make it a lambda instead?

Subject.doAs(subject, () -> loadWorker(topologyConf, stateStorage, stormClusterState, initCreds, initialCredentials));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on nit, cause it is simpler and not placed to critical path.


private void setupFlushTupleTimer(final List<IRunningExecutor> executors) {
// StormTimer timerTask = workerState.getUserTimer();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

public void run() {
for (int i = 0; i < executors.size(); i++) {
IRunningExecutor exec = executors.get(i);
if(exec.getExecutorId().get(0) != -1) // dont send to system bolt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

return transferQueue;
}

public StormTimer getUserTimer() {
return userTimer;
}

final DisruptorQueue transferQueue;
final JCQueue transferQueue; //TODO: Roshan: transferQueue also needs to avoid double batching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this done? Looks kind of important.

@revans2
Copy link
Contributor

revans2 commented Jul 26, 2017

I ran again with this exact version (5c0db92) and got the same results.

@revans2
Copy link
Contributor

revans2 commented Jul 26, 2017

I tried on Linux too and got very similar results. The CPU and memory usage of the topology was lower but the actual throughput and latency of the topology was very similar.

@harshach
Copy link
Contributor

@revans2 Do you mind posting your storm.yaml or are you running with defaults. We will try to see if we can reproduce this same behavior on our side. If there are any bugs we will work to fix it and but its shows great potential on the perf improvements.
Regarding posting to twitter , Yes we are very excited about the patch and definitely want to share the results with the community. Not sure why you are getting upset about it. Its important that we make these perf improvements and also let the community know that there are continuous improvements in Storm. If you found a bug thats great thats why we've PR and review process in place.

@revans2
Copy link
Contributor

revans2 commented Jul 26, 2017

@harshach I am running with defaults in all cases

I build mvn clean install -DskipTests
package cd storm-dist/binary; mvn clean package
untar the result tar -xzvf ./final-package/target/apache-storm-2.0.0-SNAPSHOT.tar.gz
move the directory so I can save it if I want to switch back and forth mv apache-storm-2.0.0-SNAPSHOT under-test
then bring up the cluster

cd under-test;
./bin/storm dev-zookeeper &
./bin/storm nimbus &
./bin/storm supervisor &
./bin/storm ui &
./bin/storm logviewer &

Wait for everything to come up, and I can see the ui.

Then I run some tests (I pull in the storm starter from the build because the packaged one does not build on its own).

./bin/storm jar ../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar org.apache.storm.starter.ThroughputVsLatency
...
./bin/storm jar ../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar org.apache.storm.starter.ThroughputVsLatency -c topology.workers=1
...
./bin/storm jar ../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar org.apache.storm.starter.ThroughputVsLatency 10000 -c topology.workers=1
...
./bin/storm jar ../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar org.apache.storm.starter.ThroughputVsLatency 10000 -c topology.workers=1 -c topology.max.spout.pending=500
...
./bin/storm jar ../../../examples/storm-starter/target/storm-starter-2.0.0-SNAPSHOT.jar org.apache.storm.starter.ThroughputVsLatency 10000 -c topology.workers=1 -c topology.max.spout.pending=1000

@HeartSaVioR
Copy link
Contributor

First of all, I think I should be more careful about like / retweet. Sorry about that and just reverted all.

Regarding tweet, I think it heavily depends on which accounts (who) tweet it.

If users or contributors are tweeting about this PR, we can take it like a gossip and no problem on it.

If committers or PMCs are doing it, that could be shown as kind of publicizing, especially this PR compares with other frameworks and claims Storm will be no 1. performer. If PMC members tweet about unsafe or not yet stable source, and if it turned out to another result, someone could feel we (Storm community) hype, regardless of intention.
If we say this PR could increase throughput by N times from before, that would be less problematic. I don't say no problem, but much less.

If we are doing it with official account (@apachestorm), that just matters, because the account could be shown as on behalf of Storm community or at least PMC members. I was the one wondering why official account retweeted about non-Storm related tweet like https://twitter.com/BoredElonMusk/status/889935279213223936 .
If we haven't talked about which things official account should (not) tweet, we may need to have a time to talk about.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Jul 26, 2017

From now what I can suggest is... breaking down STORM-2306 to "redesign messaging subsystem" and "switch to JCTools Queues", and try out latter part first.
This patch mixes up everything (micro optimization, switching queue, messaging architecture change) in one, so hard to point out where the issue starts. Let's just have a new pull request only replacing Disruptor with JCQueue and see it helps. I'd rather lean on putting the minimal effort and make it better incrementally.
And let's separate micro-optimization (converting map usage to array list, avoiding iterator, etc.) as another issue and don't address them from other issues. I think we should have better throughput even without them if newer subsystem is better and/or JCTools Queues are better than Disruptor in our case.

@harshach
Copy link
Contributor

@HeartSaVioR lets keep this discussion to reviews. This is not forum to discuss what one should tweet or not that's up to individuals. Nobody is trying to promote something that's not feasible lets not try to be a moral authority here to suggest what one can do or not.
Regarding breaking this into multiple PRs addressing different subsystems, that's a reasonable ask. But lets wait before we go down that path we need to look into the issues raised here and reproduce the case.
I am running few tests myself and I'll report my findings.

@HeartSaVioR
Copy link
Contributor

@harshach
As I commented multiple times, this patch is mixed up of replacing queue, changing message subsystem, and micro-optimization. Three committers already had a look at the first pass (excluding @revans2 since he stopped reviewing) but we couldn't identify any critical problems unless @revans2 shared the test result. If we go on with this patch, we even don't know which part of change contribute s performance gain and how much it is.

If we were having PR only replacing JCTools, it is easy to review and identify the benefit, or pros/cons of replacing Disruptor with JCTools. The patch will be fairly simple to review and verify, and most of the issues raised here shouldn't be raised there (because I guess most of them are not from queue replacement).

@harshach
Copy link
Contributor

@HeartSaVioR I don't mind breaking this into multiple PRs if it helps reviewing and merging in. Its up to @roshannaik .

@harshach
Copy link
Contributor

harshach commented Jul 27, 2017

@revans2 @HeartSaVioR
Here are my findings https://docs.google.com/spreadsheets/d/1wPpC3YXp-vTIelRTUVoLxuxIYxUekiIHUpZ2ZEysC4Y/edit#gid=1644511...

  1. Looking at ThroughputvsLatency I found some issues:

I did the following changes:

  • Disable the HTTP and Logging bolts
  • Disable throttling spout, we want spout to run as fast as it can
  • reduced the executor counts

If you see lines from 51 - 75

Apache Master clearly couldn't handle the faster spout and starts timing out. Perf degrades considerably and very quickly. Where as STORM-2306 not only was able to handle the faster spout and delivered stable and processing at more start out being 10x faster then improves to 35x faster compared to master.

  1. Also ran storm-perf topologies ConstSpoutIdNullBoltIdTopo and ConstSpoutNullBoltTopo. These topologies are trying to see whats the message throughput and latency when there are only 2 components involved without including any external dependencies. Essentially testing the messaging system.

From line 3-45 you can see with this patch we are getting under 10ms (depends on the topology) compare to an avg of 250ms+. (with batchSize=1000)

  1. Also ran storm-examples ThroughputVsLatency with 2 workers. Here there is clearly a bug which is prevent inter-worker communication so don't have comparative numbers.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Jul 27, 2017

@harshach
For ThroughputvsLatency, throttling spout is intended. We set desired throughput and see histogram of latency and other metrics. (CPU, GC, etc.) There're 3 components in topology which parallelism would be set to 4 * worker count so total 12 executor threads per worker. I think we can parameterize the magic number 4 and adjust it while testing too.

I have also done with some performance tests, without modifying TVL topology. The reason is that we should also care about non-performance-maximized topology. For benchmarking performance maximized topology we also have ConstSpoutIdBoltNullBoltTopo, so let's not modify TVL and verify this patch works with all the cases.

Since this patch doesn't seem to handle inter-worker communication properly, the test set what we can do for now is very limited.

Here's my machine spec used for performance test:

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
OS: Ubuntu 17.04
CPU: AMD Ryzen 5 1600 3.2Ghz 6 core (with hyper-thread = 12 logical cores)
RAM: Samsung DDR4 32G 19200
SSD: Samsung 850 Evo

and here's my number (just pasted as raw number):
https://docs.google.com/spreadsheets/d/1J3S4R68CsazlINF60rQn4XCy2Hx5QNzqVoWSO9qo2tc/edit?usp=sharing

My observation is that this patch looks impressive with performance maximized topology, but this also looks really bad (not acceptable) with relatively idle topology.
I've observed all the things what @revans2 observed with TVL tests, one example is that with default parameter (or rate set to 10000, don't remember) of TVL, a worker for this patch showed over 300% (around 340%) of CPU usage whereas a worker for stock showed less than 70%.
But surprisingly this patch looks stable with ConstSpoutIdBoltNullBoltTopo and even CPU usage seems lower than stock in this test.

While we often publicize micro-benchmark result, in practice users would run much idle topologies.
I'm OK if things can be stabilized with adjusting parameters (if then I think default value should be set to support such case), but if not, it should be addressed before accepting the patch. I would be -1 until TVL results with various rate are not stable.

@harshach
Copy link
Contributor

harshach commented Jul 27, 2017

@HeartSaVioR Its not 12 executors per worker. If you don't pass a command-line argument, it sets parallelism variable here to 4 https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/ThroughputVsLatency.java#L277
and multiplys with 4 here again https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/ThroughputVsLatency.java#L359 . So setting a parallelism unit 16 per component. so with 3 components 48 + 2 metrics consumer + 1 acker = 51.
This is nothing to do with how many workers you've.

I still need to look more in detail of TVL topology, before doing so want to write a real-world topology that reads from Kafka and writes another Kafka topic and test that out. The issue here is not with idle topology but a topology that is throttling the input rate and its not idle. We've two metrics consumers with parallelism set to 1. How confident are we that the metrics consumers are not causing the issue. I'll look into these in more detail and update my findings here.

@HeartSaVioR
Copy link
Contributor

@harshach The second argument is effectively representing worker count: you can see that topology set worker count as parallelism. I agree that the name is really misleading even I ran tests with topology.workers instead of passing second argument. (need to run test again...)

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Jul 27, 2017

Let me share a quick test result with passing 10000 1 to TVL parameter:

STORM-2306

uptime:   30 acked:   144,000 acked/sec:   4,800.00 failed:        0 99%:   3,070,230,527 99.9%:   3,221,225,471 min:      63,897,600 max:   3,380,609,023 mean: 1,299,365,069.36 stddev:  685,287,508.16 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:   60 acked:   303,000 acked/sec:  10,100.00 failed:        0 99%:   3,011,510,271 99.9%:   3,200,253,951 min:      28,540,928 max:   3,303,014,399 mean: 1,283,728,691.41 stddev:  671,791,145.42 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:   90 acked:   297,000 acked/sec:   9,900.00 failed:        0 99%:   3,047,161,855 99.9%:   3,307,208,703 min:      62,980,096 max:   3,737,124,863 mean: 1,283,141,447.64 stddev:  675,126,086.16 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:  120 acked:   303,000 acked/sec:  10,100.00 failed:        0 99%:   3,047,161,855 99.9%:   3,206,545,407 min:      31,965,184 max:   3,347,054,591 mean: 1,284,140,763.79 stddev:  690,625,730.54 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:  150 acked:   299,000 acked/sec:   9,966.67 failed:        0 99%:   3,072,327,679 99.9%:   3,231,711,231 min:      16,703,488 max:   3,414,163,455 mean: 1,320,620,493.23 stddev:  693,327,734.87 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:  180 acked:   300,000 acked/sec:  10,000.00 failed:        0 99%:   3,042,967,551 99.9%:   3,248,488,447 min:      48,005,120 max:   3,846,176,767 mean: 1,313,068,274.86 stddev:  671,810,427.83 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:  210 acked:   301,000 acked/sec:  10,033.33 failed:        0 99%:   3,061,841,919 99.9%:   3,363,831,807 min:      51,347,456 max:   3,802,136,575 mean: 1,297,807,219.57 stddev:  678,980,965.35 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:  240 acked:   301,000 acked/sec:  10,033.33 failed:        0 99%:   3,019,898,879 99.9%:   3,208,642,559 min:      36,962,304 max:   3,363,831,807 mean: 1,315,037,518.24 stddev:  676,620,121.79 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:  270 acked:   297,000 acked/sec:   9,900.00 failed:        0 99%:   3,026,190,335 99.9%:   3,200,253,951 min:      52,363,264 max:   3,349,151,743 mean: 1,308,161,023.51 stddev:  680,121,348.29 user:          0 sys:          0 gc:          0 mem:       0.00
uptime:  300 acked:   300,000 acked/sec:  10,000.00 failed:        0 99%:   3,021,996,031 99.9%:   3,200,253,951 min:      49,348,608 max:   3,317,694,463 mean: 1,335,928,012.31 stddev:  667,642,145.32 user:          0 sys:          0 gc:          0 mem:       0.00

CPU usage was around 150 ~ 250%, mostly around 160% which seemed to be a bit more stable, but still fluctuating with small rate 10000.

current

uptime:   30 acked:   140,440 acked/sec:   4,681.33 failed:        0 99%:      14,016,511 99.9%:      26,558,463 min:       2,449,408 max:      52,035,583 mean:    7,800,556.68 stddev:    1,790,982.79 user:     28,620 sys:      2,340 gc:          0 mem:      49.94
uptime:   60 acked:   301,860 acked/sec:  10,062.00 failed:        0 99%:      11,141,119 99.9%:      15,351,807 min:       3,233,792 max:      26,181,631 mean:    7,479,081.72 stddev:    1,175,253.40 user:     30,270 sys:      6,800 gc:        190 mem:      54.88
uptime:   90 acked:   301,600 acked/sec:  10,053.33 failed:        0 99%:      10,813,439 99.9%:      13,197,311 min:       3,246,080 max:      16,138,239 mean:    7,375,841.06 stddev:    1,112,541.35 user:     31,660 sys:      7,160 gc:        194 mem:      54.68
uptime:  120 acked:   301,460 acked/sec:  10,048.67 failed:        0 99%:      11,042,815 99.9%:      13,828,095 min:       3,266,560 max:      17,285,119 mean:    7,400,672.94 stddev:    1,130,409.32 user:     29,650 sys:      7,330 gc:        200 mem:      47.80
uptime:  150 acked:   301,500 acked/sec:  10,050.00 failed:        0 99%:      10,911,743 99.9%:      13,246,463 min:       3,248,128 max:      15,654,911 mean:    7,399,041.82 stddev:    1,118,368.85 user:     29,920 sys:      7,370 gc:        199 mem:      41.95
uptime:  180 acked:   301,540 acked/sec:  10,051.33 failed:        0 99%:      10,969,087 99.9%:      13,598,719 min:       3,233,792 max:      16,302,079 mean:    7,390,435.62 stddev:    1,129,976.16 user:     29,840 sys:      7,190 gc:        201 mem:      41.16
uptime:  210 acked:   301,540 acked/sec:  10,051.33 failed:        0 99%:      11,182,079 99.9%:      14,557,183 min:       3,246,080 max:      19,513,343 mean:    7,382,121.55 stddev:    1,161,863.92 user:     29,620 sys:      7,460 gc:        206 mem:     120.40
uptime:  240 acked:   301,560 acked/sec:  10,052.00 failed:        0 99%:      10,903,551 99.9%:      13,303,807 min:       3,262,464 max:      15,187,967 mean:    7,361,940.07 stddev:    1,119,275.14 user:     29,830 sys:      7,150 gc:        205 mem:      63.19
uptime:  270 acked:   301,360 acked/sec:  10,045.33 failed:        0 99%:      10,944,511 99.9%:      13,541,375 min:       3,274,752 max:      16,891,903 mean:    7,349,283.96 stddev:    1,132,254.05 user:     29,530 sys:      7,220 gc:        197 mem:     107.99
uptime:  300 acked:   301,420 acked/sec:  10,047.33 failed:        0 99%:      11,132,927 99.9%:      13,967,359 min:       3,250,176 max:      16,957,439 mean:    7,315,955.89 stddev:    1,158,914.73 user:     29,220 sys:      7,590 gc:        210 mem:      49.34

CPU usage was around 120%, no outstanding fluctuation (+- 10%).

@revans2
Copy link
Contributor

revans2 commented Nov 6, 2017

@roshannaik any update on this? Is there anything I can do to help get this done for 2.x?

@roshannaik
Copy link
Contributor Author

roshannaik commented Nov 6, 2017

@revans2 thanks very much for the offer to help, i think it might be useful to get past this issue that is blocking this.

Just updated the PR with these two key changes.

  • Discovered that workers need to communicate their internal back pressure situation to other workers so that they stop/resume sending messages to components experiencing BP.
  • Now Bolts also use non-blocking emits so that in case of BP they can continue to process metrics ticks.

Curent Status:

  • Some new metrics needed for monitoring overflow needs to be introduced. Also some minor todos mentioned in code need to be addressed.
  • During scale testing, found that under certain multi-worker scenarios, noticed a inexplicable performance drop. The simplest way to reproduce the issue appears to be using a 2 workers setup with spouts and bolts distributed so that there is messaging occurring in both directions. Don't have a good grip on the issue yet. Unfortunately last 3 weeks had to switch my attention to some other pressing release centric tasks at work.

May be useful to sync up with you offline soon and see if looking into the issue together helps.

…ication

- Fixed some issues found during multi-worker testing
- Fixes to perf topos to pick up cmd line args correctly
- Removing support for single producer mode in JCQueue
- Almost all msg transfers are non-blocking now (other than metrics and credentials-change notifications). Allows spouts and bolts to processes metrics ticks even if output path is choked due to backpressure
- Some temporary debug logs that need to be removed after testing
- Needs more multiworker mode testing
Roshan Naik added 3 commits November 30, 2017 02:19
- Adding workerID to BpStatus for better debuggability
- Logging the length of an idle stretch for BP  & max.spout.pending wait situations
- Changes to defaults:  topology.executor.receive.buffer.size=32k (rounding up to power of 2),   topology.flush.tuple.freq.millis=1 (same as master)
- minor fixes and improvements
Roshan Naik added 2 commits December 4, 2017 17:12
…it Strategy settings + topology.flush.tuple.freq.millis -> topology.batch.flush.interval.millis )
- Added BackPressure unit tests
Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another partial review done. I finished reviewing to Task.java, and will continue to review sooner when I can find another time.

topology.batch.flush.interval.millis: 1 # Flush tuples are disabled if this is set to 0 or if (topology.producer.batch.size=1 and topology.transfer.batch.size=1).
topology.spout.recvq.skips: 3 # Check recvQ once every N invocations of Spout's nextTuple() [when ACKs disabled]

topology.disable.loadaware.messaging: false # load aware messaging can degrade throughput
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be better to describe the cases when we recommend using load aware messaging, or disable load aware messaging. The comment may mislead users to consider this option as always better to disable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would be good but I dont have a good handle on loadAware's perf characteristics, other than noticing throughput degradation with the ConstSpout* Topos to make good recommendations. I had initially disabled it, but then since there was (well founded) opposition from Bobby wrt disabling it, so I reverted it and left the short comment.

Bobby mentioned some interesting improvements coming to LoadAware (like scheduling communicating tasks within same worker instead of even distribution across workers)... so hopefully this situation may change.

Since this is file is not conducive to making verbose recommendations, perhaps we can put better guidance in the Performance.md... if anyone has some concrete inputs about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bobby mentioned some interesting improvements coming to LoadAware (like scheduling communicating tasks within same worker instead of even distribution across workers)... so hopefully this situation may change.

Yes basic locality aware is already applied to load aware shuffle grouping, and I also expect improvement on determining locality will be available in future.

Since this is file is not conducive to making verbose recommendations, perhaps we can put better guidance in the Performance.md... if anyone has some concrete inputs about it.

Agreed. Details could be put to Performance.md.



## Max.spout.pending
The back pressure mechanism no longer requires `topology.max.spout.pending`. It is recommend to set this to null (default).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the report posted in issue, setting this option still shows better result and hence the tests in the report utilizes the option. Do we want to guide it as well, or still be better to recommend users to set this to null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that. This comment is no longer true. Will fix.

@Override
public void run() {
long start = System.currentTimeMillis();
// while (!Thread.interrupted()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected to toggle uncomment (and comment while (!halt) {)? If not, let's just remove this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

@Override
public void run() {
long start = System.currentTimeMillis();
// while (!Thread.interrupted()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

public void run() {
Handler handler = new Handler();
long start = System.currentTimeMillis();
// while(!Thread.interrupted()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here and comments in below lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

if (grouper == GrouperFactory.DIRECT) {
throw new IllegalArgumentException("Cannot do regular emit to direct stream");
}
List<Integer> compTasks = grouper.chooseTasks(taskId, values, loadMapping);
outTasks.addAll(compTasks);
outTasks.addAll(compTasks); // TODO: PERF: this is a perf hit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: may be better to decide.

IMHO I'm still not convinced that it can introduce performance hit. Something I can imagine are allocating backed array (only once in method call) and expanding array, but unless we use fan-out in huge size topology, outTasks is expected to be small.

@@ -177,6 +196,35 @@ public BuiltinMetrics getBuiltInMetrics() {
return builtInMetrics;
}


// Non Blocking call. If cannot emmit to destination immediately, such tuples will be added to `pendingEmits` argument
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: emmit -> emit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

/**
* Send sampled data to the eventlogger if the global or component level debug flag is set (via nimbus api).
*/
public void sendToEventLogger(Executor executor, List values,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the method be called every emit when number of executors of event logger are more than 0? If then we may want to apply more optimizations here (not in this issue but another following up issue - others can take it up).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Since topology.eventlogger.executors=0 (disabled) by default, i didn't spend time on this.

@@ -192,7 +240,7 @@ private TopologyContext mkTopologyContext(StormTopology topology) throws IOExcep
ConfigUtils.supervisorStormDistRoot(conf, workerData.getTopologyId())),
ConfigUtils.workerPidsRoot(conf, workerData.getWorkerId()),
taskId,
workerData.getPort(), workerData.getTaskIds(),
workerData.getPort(), workerData.getLocalTaskIds(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed method looks better to clarify the meaning. Nice improvement.

@@ -186,9 +201,11 @@ public Runnable getSuicideCallback() {
final AtomicBoolean isTopologyActive;
final AtomicReference<Map<String, DebugOptions>> stormComponentToDebug;

// executors and taskIds running in this worker
// executors and localTaskIds running in this worker
final Set<List<Long>> executors;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the comment, this can be also localExecutors.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still in progress of reviewing... Will continue from SpoutOutputCollectorImpl.java

itr.remove();
changed = true;
} else {
// LOG.info("Task = {}, OverflowCount = {}, Q = {}", entry.getKey(), entry.getValue().getOverflowCount(), entry.getValue().getQueuedCount() );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want to leave it with DEBUG, or remove the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

new WorkerState(conf, context, topologyId, assignmentId, port, workerId, topologyConf, stateStorage,
return loadWorker(topologyConf, stateStorage, stormClusterState, initCreds, initialCredentials);
}
}); // Subject.doAs(...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on nit, cause it is simpler and not placed to critical path.

@@ -193,6 +210,24 @@ public void run() {
});
}

/**
* Schedule a function to run recurrently
* @param delayMs the number of seconds to delay before running the function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the unit is not seconds but milliseconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed.

private void setupFlushTupleTimer(final Map<String, Object> topologyConf, final List<IRunningExecutor> executors) {
final Integer producerBatchSize = ObjectReader.getInt(topologyConf.get(Config.TOPOLOGY_PRODUCER_BATCH_SIZE));
final Integer xferBatchSize = ObjectReader.getInt(topologyConf.get(Config.TOPOLOGY_TRANSFER_BATCH_SIZE));
final Long flushIntervalMicros = ObjectReader.getLong(topologyConf.get(Config.TOPOLOGY_BATCH_FLUSH_INTERVAL_MILLIS));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config name is milliseconds, whereas variable name is microseconds. if you want to convert it later, it would be better to do it here, or just rename to milliseconds.

Looks like it is used as milliseconds below, so I guess the variable name and below log messages in this method are wrong.


private void setupBackPressureCheckTimer(final Map<String, Object> topologyConf) {
final Integer workerCount = ObjectReader.getInt(topologyConf.get(Config.TOPOLOGY_WORKERS));
if (workerCount == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to change it to if (workCount <= 1) to cover workerCount == 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that was the intent.

if (consumeIdleCounter==0) {
LOG.debug("Invoking consume wait strategy");
}
consumeIdleCounter = consumeWaitStrategy.idle(consumeIdleCounter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we let wait strategy instances handle relevant counters (consumeIdleCounter, bpIdleCount)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is possible, but the counters have to be reset and examined if ==0 .... from outside, it didnt seem much benefit to abstract it inside the strategy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's revisit if we will get confused again.

boltFailInfo.applyOn(taskData.getUserContext());
if (delta >= 0) {
boltFailInfo.applyOn(task.getUserContext());
if (delta != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you addressed above only. It should stay the same.

(when (<= 0 delta)
(stats/bolt-failed-tuple! executor-stats
(.getSourceComponent tuple)
(.getSourceStreamId tuple)
delta))))

final boolean isActive = stormActive.get();
public Long call() throws Exception {
int receiveCount = 0;
if (i++ == recvqCheckSkipCount) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: we may be able to have longer but better variable name other than i, since it's not a temporary short scope variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes makes sense.

rmspCount = 0;
}

if ( receiveCount>1 ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional to compare this with 1? I guess it looks natural to compare with 0, but you may have some reasons to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good observation. Actually it used to be 0. Then I noticed it was somewhat common case for there to be just one item in the recvQ (metrics tick, flush tuple etc) during idle stretches. Once we consume that, we idle immediately, rather than check again. a minor optimization.

if (!transferQueue.isFull() && !throttleOn && !reachedMaxSpoutPending) {
for (ISpout spout : spouts) {
spout.nextTuple();
boolean pendingEmitsIsEmpty = tryFlushPendingEmits();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: The logic for when Spout is active is now somewhat long and it contains lots of nested if and return. May be better to refactor a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i tried once and rolled it back.. due to the two levels of callables. will try again.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot one to comment:
I have seen such kind of lineidToTask.get(taskId - idToTaskBase) so many times. Given that it is not easy to understand unless we indicate how idToTask is build, it would be really better if we could find a way to abstract it.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just finished reviewing, except JCQueue and related tests. I'll see I can review them afterwards.

writeMessage(channel, batch);
}
} catch (IOException e) {
dropMessages(msgs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is !channel.isConnected() the only case for IOException? If then it doesn't swallow the exception quietly, but it will swallow the exception quietly if there're other cases for IOException.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The writeMessage() is the only one that throws that IOException ( after checking if !channel.isConnected()). I think I should add a log msg also.

@@ -39,7 +38,7 @@ public KryoTupleDeserializer(final Map<String, Object> conf, final GeneralTopolo
_kryoInput = new Input(1);
}

public Tuple deserialize(byte[] ser) {
public TupleImpl deserialize(byte[] ser) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder: Let's add @Override, and just revert back if we don't leverage the explicit implementation type.

@@ -30,4 +31,5 @@
void ack(Tuple input);
void fail(Tuple input);
void resetTimeout(Tuple input);
void flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is like public API, but given that we ship the change to major version only, I guess that might be fine. If we have alternatives it would be better though.


@Override
public void flush() {
//NOOP //TODO: Roshan: validate if this is OK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove TODO if it is addressed, or address this before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure yet. Wanted others opinion. Would you know if its ok for that to be a NO-OP?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are adding this in this patch, so I'm also not sure about that.

public TupleImpl(GeneralTopologyContext context, List<Object> values, int taskId, String streamId) {
this(context, values, taskId, streamId, MessageId.makeUnanchored());

//// Perf Note: Disabling these checks as they are a throughput bottleneck
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friendly reminder: if we are OK to drop schema validation, let's just remove the code. If we still want to validate the schema, let's leave it as it is.
(Maybe we could pass schema as well for same reason as srcComponent.)

if (s==null) // then stop running it
break;
if (s>0)
Thread.sleep(s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Time.sleepSecs() for simulated time.

@@ -1505,4 +1507,20 @@ public static StormTopology addVersions(StormTopology topology) {
}
return defaultsConf;
}

public static <V> ArrayList<V> convertToArray(Map<Integer, V> srcMap, int start) {
Set<Integer> executorIds = srcMap.keySet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use more generic variable name: ids would be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed.

@@ -79,6 +79,7 @@
protected final long lowMemoryThresholdMB;
protected final long mediumMemoryThresholdMb;
protected final long mediumMemoryGracePeriodMs;
private static int port = 5006; // TODO: Roshan: remove this after stabilization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friendly reminder: this should be addressed before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. will do this last. need for debugging.

@@ -747,6 +748,16 @@ protected String javaCmd(String cmd) {
commandList.add("-Dstorm.conf.file=" + topoConfFile);
commandList.add("-Dstorm.options=" + stormOptions);
commandList.add("-Djava.io.tmpdir=" + workerTmpDir);
if (_topoConf.get("attach.debugger") != null) { // TODO: Roshan: remove this after stabilization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friendly reminder: this should be addressed before merging.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also reviewed JCQueue related files: I'm finished another pass of reviewing.

}
} // class DirectInserter

private static class BatchInserter implements Inserter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: may be better to leave a comment that the class is non thread-safe. Even better to also add comment to ThreadLocal variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

if (producerBatchSz > 1) {
inserter = thdLocalBatcher.get();
if (inserter == null) {
BatchInserter b = new BatchInserter(this, producerBatchSz);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: Is there specific reason to have temporary local variable b instead of directly using local inserter variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid casting.
thdLocalBatcher.set(b) expects a BatchInserter. inserter is of base type.

}


public interface ExitCondition {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Predicate with proper variable names might be able to get rid of this interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. i had it that way originally, then felt that the call to keepRunning() made things more readable (made it clear that if it returns true we will keep running as opposed to quit). I can go back to predicate interface if you feel this is not desirable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK makes sense. Let's keep it as it is.

for (int i = 0; i < producerNum; i++) {
producerThreads[i].join(TIMEOUT);
assertFalse("producer "+i+" is still alive", producerThreads[i].isAlive());
}
queue.haltWithInterrupt();
consumerThread.join(TIMEOUT);
assertFalse("consumer is still alive", consumerThread.isAlive());
//TODO need to fix this... assertFalse("consumer is still alive", consumerThread.isAlive());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fixed, or still need to fix it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes its its fixed. will remove the comment.

}


// check that tryPublish() & tryOverflowPublish() work as expected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed to add @Test, or intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. will fix.

Roshan Naik added 2 commits December 28, 2017 01:15
Copy link
Contributor Author

@roshannaik roshannaik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Thanks @HeartSaVioR. Addressed most of your comments . Will get back on them.

for (LoadAwareCustomStreamGrouping grouper : streamComponentToGrouper.get(stream).values()) {
ArrayList<Integer> outTasks = new ArrayList<>();

// TODO: PERF: expensive hashtable lookup in critical path
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be possible to fix, but needs some work... one way is to see if streamIds can be internally represented as ints.
Filed issue

if (grouper == GrouperFactory.DIRECT) {
throw new IllegalArgumentException("Cannot do regular emit to direct stream");
}
List<Integer> compTasks = grouper.chooseTasks(taskId, values, loadMapping);
outTasks.addAll(compTasks);
outTasks.addAll(compTasks); // TODO: PERF: this is a perf hit
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your position, as I was surprised too. I have filed an issue and explained in it how I originally discovered it. I didnt pursue it too much, so I dont know why it is an issue (or if it still is), but felt it was good idea to record that observation for later.

@@ -177,6 +196,35 @@ public BuiltinMetrics getBuiltInMetrics() {
return builtInMetrics;
}


// Non Blocking call. If cannot emmit to destination immediately, such tuples will be added to `pendingEmits` argument
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

/**
* Send sampled data to the eventlogger if the global or component level debug flag is set (via nimbus api).
*/
public void sendToEventLogger(Executor executor, List values,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Since topology.eventlogger.executors=0 (disabled) by default, i didn't spend time on this.

itr.remove();
changed = true;
} else {
// LOG.info("Task = {}, OverflowCount = {}, Q = {}", entry.getKey(), entry.getValue().getOverflowCount(), entry.getValue().getQueuedCount() );
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


@Override
public void flush() {
//NOOP //TODO: Roshan: validate if this is OK
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure yet. Wanted others opinion. Would you know if its ok for that to be a NO-OP?

@Override
public void run() {
long start = System.currentTimeMillis();
// while (!Thread.interrupted()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

public void run() {
Handler handler = new Handler();
long start = System.currentTimeMillis();
// while(!Thread.interrupted()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

if (workerData.tryTransferRemote(addressedTuple, pendingEmits)) {
++remotesBatchSz;
if (remotesBatchSz >= producerBatchSz) {
workerData.tryFlushRemotes();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to take a look more carefully. will get back on that.

try {
Task task = new Task(executor, taskId);
executor.sendUnanchored(
task, StormCommon.SYSTEM_STREAM_ID, new Values("startup"), executor.getExecutorTransfer());
task.sendUnanchored( StormCommon.SYSTEM_STREAM_ID, new Values("startup"), executor.getExecutorTransfer(), null); // TODO: Roshan: does this get delivered/handled anywhere ?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeartSaVioR, @revans2 , others : would you know why we have this call to sendUnanchored ? I recall verifying on master (and 2306) that it doesn't get delivered to anywhere. Could remove it if not needed.

@revans2
Copy link
Contributor

revans2 commented Jan 4, 2018

Sorry it has taken me so long to get back to this. I did a quick pass through the code, but I need to spend some time to go more in depth. I expect to be doing this this afternoon and tomorrow.

My biggest concern right now is around up-merging and getting the unit tests to pass. I don't see a reason to have JCQueueTest fail 100% of the time.

I am also concerned that a lot of other unit tests fail 100% of the time with this patch too.

@roshannaik
Copy link
Contributor Author

This PR page has become unusable. Most of the time it doesn't load. If it does load, it is too slow to use...even for just typing a comment. Let stop using this and switch to the new PR (thanks for the suggestion @HeartSaVioR ) There are a few recent unaddressed comments on this PR that I will track and address in the new PR.

@roshannaik roshannaik closed this Jan 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants