Skip to content

Conversation

@d2r
Copy link

@d2r d2r commented May 16, 2015

No description provided.

@d2r
Copy link
Author

d2r commented May 16, 2015

This pull request requires some more scrutiny because of its size and to make sure it does not break functionality.

Goals for this PR

  • New Nimbus Thrift calls:
    • getTopologyPageInfo
    • getComponentPageInfo
  • Nimbus Thrift call getTopologyInfo remains
  • Aggregates metrics on nimbus so that serialized data is much smaller for large, highly-connected topologies
  • Moves ui code to stats namespace
  • Iterates over heartbeats one time.
  • Thrift deserialization code avoids deserializing to symbols, preferring strings, since making a symbol out of a string like "600" results in a symbol 600 and not an integer 600. This can be very confusing to debug, and breaks key comparisons in places.
  • Removes some dead code and does a little refactoring in places.

EDIT to add:

  • Because data is aggregated on nimbus side, visualization is only initialized now when the visualization is activated.

@d2r d2r force-pushed the storm-820-agg-stats-on-nimbus branch from 3d3c2b7 to a16b50c Compare May 21, 2015 16:24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have the name log somewhere in the function name. perhaps log-pprint

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pull the value out in the for's binding vector?

(for [[k v] idk->exec-avg]
  [k {:executeLatencyTotal (weight-avg k v)
    :processLatencyTotal (weight-avg k v)
    :executed v}]))))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. Nevermind. Totally missed that the maps were different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see "600" def'd somewhere.

@d2r
Copy link
Author

d2r commented May 22, 2015

Review comments should be addressed.

@d2r
Copy link
Author

d2r commented May 22, 2015

Addressed review comments

@knusbaum
Copy link
Contributor

+1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could result in a null dispatch value if the component's heartbeat does not contain metrics. We should be using the :type directly instead of accessing it via :stats.

@d2r
Copy link
Author

d2r commented May 28, 2015

Debugging a corner case we found today...

@d2r d2r closed this May 28, 2015
@d2r
Copy link
Author

d2r commented May 28, 2015

Handled some corner cases when the worker process crashes such that executor heartbeats have no metrics data.

@d2r d2r reopened this May 28, 2015
@d2r
Copy link
Author

d2r commented Sep 4, 2015

Leaving a comment here for now: the key here should be :complete-latency or else each spout's complete latency will show 0.000 on the topology page. (Component page is not affected.)

@d2r
Copy link
Author

d2r commented Sep 24, 2015

This needs to be upmerged with the nimbus-ha changes

@d2r d2r closed this Sep 24, 2015
@d2r d2r reopened this Sep 29, 2015
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix, only a time stamp change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a "\n" before logging the message?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was to clean up the output. Otherwise the data structure would not begin on its own line but instead begin after the timestamp, etc., and so that was not "pretty."

@revans2
Copy link
Contributor

revans2 commented Sep 29, 2015

For the most part this looks good. My only comment would be to update the REST API docs to describe what the new APIs are like. And did you change time to errorTime in the JSON response?

@d2r
Copy link
Author

d2r commented Sep 29, 2015

And did you change time to errorTime in the JSON response?

Good catch. Updated API doc with a note on the name change.

@revans2
Copy link
Contributor

revans2 commented Sep 29, 2015

Thanks for the update I am +1 on the change now. The aggregation code is a bit complex, but I cannot think of any way to make it less complex right now.

@revans2
Copy link
Contributor

revans2 commented Oct 3, 2015

@d2r please upmerge. The code still looks great but it looks like upSecs and a debug were added into the results. I really want to get this in, as it will speed up page loads a lot.

@d2r
Copy link
Author

d2r commented Oct 5, 2015

ok working on it

@d2r
Copy link
Author

d2r commented Oct 5, 2015

OK, I upmerged, and then had to upmerge again right after that. I double-checked that uptimeSeconds appear in the places they should, and that I was able to change the log level for a worker running ExclamationTopology.

@revans2
Copy link
Contributor

revans2 commented Oct 5, 2015

Still +1

@asfgit asfgit merged commit 2d59aed into apache:master Oct 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants