Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement BrokerHostUsage using java #88

Merged
merged 1 commit into from
Nov 16, 2016

Conversation

sschepens
Copy link
Contributor

Motivation

Issue #81
I've made a first go a this, I believe it's working ok, please review 😊

Modifications

I've removed support for usage script and made BrokerHostUsage calculate metrics from system data.
Current implementation only supports Linux, we should probably find a way to implement the same functionallity for other OS.

Result

Broker can collect usage metrics without need for a script.

@yahoocla
Copy link

CLA is valid!

@sschepens sschepens force-pushed the host_usage_collection branch 3 times, most recently from 0c564b7 to 3533040 Compare October 26, 2016 20:01
Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good, added few comments

loadBalancerHostUsageScriptPath=

# Frequency of sar report to collect
# Frequency of report to collect
loadBalancerHostUsageCheckIntervalMinutes=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could change this setting to specify the frequency of the stats collections, and since we don't rely on sar anymore, we could specify intervals of < 1min

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it really make sense to have intervals of < 1min? Would we be updating these reports to ZK everytime?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super-sure it would be useful. Anyway this would not necessarily be controlling the frequency of update in ZK, just the frequency of collection, and the time window on which to compute the average rates.

Even now, updating ZK is skipped if there are no significant changes in the host load.

@@ -32,15 +41,26 @@
public class BrokerHostUsage {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it would be worth to have a base interface and instantiate the right implementation at startup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What interface do we plan on having? Just exposing SystemResourceUsage getBrokerHostUsage() would be enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Even some of the OS-indipendent methods (eg: used mem, processors count) could be put in a base abstract class.

usage.setCpu(new ResourceUsage(0d, totalCpuLimit));
} else {
double elapsedSeconds = (now - lastCollection) / 1000d;
Double nicUsageTx = (totalNicUsageTx - lastTotalNicUsageTx) / elapsedSeconds;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use double and long as much as possible, since every Double and Long will trigger an allocation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I missed these

if (exitCode != 0) {
LOG.warn("Process exited with non-zero exit code - [{}], stderr - [{}] ", exitCode, writer.toString());
throw new IOException(writer.toString());
String[] words = Files.lines(Paths.get("/proc/stat"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit terse, we should include a comment with an example of the file line. 😄
The computation looks correct though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


return new CpuStat(values.sum(), values.limit(3).sum());
} catch (IOException e) {
return new CpuStat(0L, 0L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not blindly ignore exception, otherwise it would be very difficult to discover any problem.

.map(path -> path.getFileName().toString())
.collect(Collectors.toList());
} catch (IOException e) {
throw new RuntimeException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we should either propagate the IOException or at least print a warning log message if we can't read the information.

} catch (IOException e) {
return 0d;
}
}).sum() * 1024 / 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should report kbits/s for NIC speed and traffic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!

@@ -71,25 +91,152 @@ public BrokerHostUsage(PulsarService pulsar) {
*
* @throws IOException
*/
public String getBrokerHostUsage() throws IOException {
StringWriter writer = new StringWriter();
public synchronized SystemResourceUsage getBrokerHostUsage() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refresh should not be done every time this method is called, but rather at regular intervals (eg: every 1min)

Copy link
Contributor Author

@sschepens sschepens Oct 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If calculations are done every 1min, but the data es retreived on a longer period, does it really make sense to make calculations at shorter periods? Would we always return the average usage of the last minute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation runs the script everytime this method is called.
How often is this method actually called?
This method would return the average usage between two calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be consistent on what data we report. Eg: all the averages should be computed on the same time window, across all the brokers, to make a meaningful comparison.

Now, if a broker see that its own load has not changed from what it has already reported on ZK (+- x%), it will not update ZK (for a certain time), but when it does, it should report the same avg over same time window.

From an API perspective, the semantic of BrokerHostUsage should not change if we call the same method twice in short time.

In current case it's maybe true that the getBrokerHostUsage() is called exactly every 1min, but sometimes happens that some new component that need the same information will start calling the same method... and that will start giving fun results.. (we've seen some of these things :) )

Because of that, I was thinking to use an executor and schedule the task to update the rates and the getBrokerHostUsage() will just return the latest computed rates.

@@ -4,4 +4,4 @@
# Since the original script is system dependent and also requires /home/y location, this script is the dummy one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this script as well.

@merlimat merlimat added this to the 1.16 milestone Oct 27, 2016
@merlimat merlimat added the type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages label Oct 27, 2016
@sschepens sschepens force-pushed the host_usage_collection branch 2 times, most recently from 1116970 to 661c481 Compare October 31, 2016 21:14
@sschepens
Copy link
Contributor Author

@merlimat do these last changes seem ok? I've reused loadManagerExecutor to schedule usage calculations, and addressed most of the comments I believe.

}

private class CpuStat {
private Long totalTime;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use 2 long here

@@ -177,7 +177,7 @@ public SimpleLoadManagerImpl(PulsarService pulsar) {
placementStrategy = new WRRPlacementStrategy();
lastLoadReport = new LoadReport(pulsar.getWebServiceAddress(), pulsar.getWebServiceAddressTls(),
pulsar.getBrokerServiceUrl(), pulsar.getBrokerServiceUrlTls());
brokerHostUsage = new BrokerHostUsage(pulsar);
brokerHostUsage = new LinuxBrokerHostUsageImpl(pulsar);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will generate errors when running on MacOS. We should have a basic factory that returns a generic host usage implementation when not running on Linux.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, we don't need precise load information for MacOS, since it's not meant to be used for prod.. just not throwing exception or printing error logs, at least for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative would be to also use OperatingSystemMXBean for cpu usage collection, it exposes getSystemCpuLoad() which the recent cpu usage, however, the period considered is not documented.
We could probe this method once every 1 or 2 seconds probably and take the average.
This would make load calculation more portable.

Nic usage however, would remain linux-specific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use the mbean based CPU usage for the "generic os" implementation and keep using /proc/cpustat on Linux.

For mac, we could omit he NIC traffic until we find a reasonable way to retrieve that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use the mbean based CPU usage for the "generic os" implementation and keep using /proc/cpustat on Linux.

For mac, we could omit he NIC traffic until we find a reasonable way to retrieve that.

return new CpuStat(total, total - idle);
} catch (IOException e) {
LOG.error("Failed to read CPU usage from /proc/stat", e);
return new CpuStat(0L, 0L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of usage % would we get when we fallback to (0, 0) ?
Would it look like 0% or 100% ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd get "divide by 0 error" at line 88, since the CPU time diff would be 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, i'll return null instead, as there's no other easy way around this, returning the same values will always end up in dividing by zero.

@sschepens
Copy link
Contributor Author

@merlimat pushed a generic implementation, could you review this again please?

this.lastCollection = 0L;
this.totalCpuLimit = getTotalCpuLimit();
pulsar.getLoadManagerExecutor().scheduleAtFixedRate(this::checkCpuLoad, 0, CPU_CHECK_MILLIS, TimeUnit.MILLISECONDS);
pulsar.getLoadManagerExecutor().scheduleAtFixedRate(this::calculateBrokerHostUsage, 0, hostUsageCheckInterval, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hostUsageCheckInterval is actually in minutes. Probably it'd be better to keep the time unit in the variable name to make it more explicit hostUsageCheckIntervalMin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


private void checkCpuLoad() {
cpuUsageSum += systemBean.getSystemCpuLoad();
cpuUsageSum++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpuUsageCount++

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -441,8 +436,7 @@ public void testNamespaceBundleStats() {

@Test
public void testBrokerHostUsage() {
when(pulsar1.getConfiguration().getLoadBalancerHostUsageScriptPath()).thenReturn("usageScript");
BrokerHostUsage brokerUsage = new BrokerHostUsage(pulsar1);
BrokerHostUsage brokerUsage = new LinuxBrokerHostUsageImpl(pulsar1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should either fallback to generic host usage impl or disable the test on non-linux.

@sschepens sschepens force-pushed the host_usage_collection branch 8 times, most recently from 40886e9 to 4760b0f Compare November 16, 2016 14:28
@sschepens
Copy link
Contributor Author

@merlimat tests are green now, can you give this a last review?

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for fixing this!

@merlimat merlimat merged commit f999394 into apache:master Nov 16, 2016
sijie pushed a commit to sijie/pulsar that referenced this pull request Mar 4, 2018
…r. Also consolidate all info in worker in FunctionRuntimeInfo (apache#88)
hangc0276 pushed a commit to hangc0276/pulsar that referenced this pull request May 26, 2021
* Fix memory leak and optimize memory usage

*Motivation*

There is a memory leak in MessageRecordUtils.

*Modifications*

- Fix the memory leak in MessageRecordUtils
- Avoid memory allocation when serializing the response
- Optimize the reference counting for kafka request holding the payload reference

* fix checkstyle

* add mem conf into bin/kop

Co-authored-by: Jia Zhai <zhaijia@apache.org>
dlg99 added a commit to dlg99/pulsar that referenced this pull request Aug 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants