-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement BrokerHostUsage using java #88
Conversation
CLA is valid! |
0c564b7
to
3533040
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good, added few comments
loadBalancerHostUsageScriptPath= | ||
|
||
# Frequency of sar report to collect | ||
# Frequency of report to collect | ||
loadBalancerHostUsageCheckIntervalMinutes=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could change this setting to specify the frequency of the stats collections, and since we don't rely on sar
anymore, we could specify intervals of < 1min
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it really make sense to have intervals of < 1min? Would we be updating these reports to ZK everytime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super-sure it would be useful. Anyway this would not necessarily be controlling the frequency of update in ZK, just the frequency of collection, and the time window on which to compute the average rates.
Even now, updating ZK is skipped if there are no significant changes in the host load.
@@ -32,15 +41,26 @@ | |||
public class BrokerHostUsage { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it would be worth to have a base interface and instantiate the right implementation at startup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What interface do we plan on having? Just exposing SystemResourceUsage getBrokerHostUsage()
would be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Even some of the OS-indipendent methods (eg: used mem, processors count) could be put in a base abstract class.
usage.setCpu(new ResourceUsage(0d, totalCpuLimit)); | ||
} else { | ||
double elapsedSeconds = (now - lastCollection) / 1000d; | ||
Double nicUsageTx = (totalNicUsageTx - lastTotalNicUsageTx) / elapsedSeconds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use double
and long
as much as possible, since every Double
and Long
will trigger an allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I missed these
if (exitCode != 0) { | ||
LOG.warn("Process exited with non-zero exit code - [{}], stderr - [{}] ", exitCode, writer.toString()); | ||
throw new IOException(writer.toString()); | ||
String[] words = Files.lines(Paths.get("/proc/stat")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit terse, we should include a comment with an example of the file line. 😄
The computation looks correct though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
|
||
return new CpuStat(values.sum(), values.limit(3).sum()); | ||
} catch (IOException e) { | ||
return new CpuStat(0L, 0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not blindly ignore exception, otherwise it would be very difficult to discover any problem.
.map(path -> path.getFileName().toString()) | ||
.collect(Collectors.toList()); | ||
} catch (IOException e) { | ||
throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we should either propagate the IOException or at least print a warning log message if we can't read the information.
} catch (IOException e) { | ||
return 0d; | ||
} | ||
}).sum() * 1024 / 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should report kbits/s
for NIC speed and traffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok!
@@ -71,25 +91,152 @@ public BrokerHostUsage(PulsarService pulsar) { | |||
* | |||
* @throws IOException | |||
*/ | |||
public String getBrokerHostUsage() throws IOException { | |||
StringWriter writer = new StringWriter(); | |||
public synchronized SystemResourceUsage getBrokerHostUsage() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refresh should not be done every time this method is called, but rather at regular intervals (eg: every 1min)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If calculations are done every 1min, but the data es retreived on a longer period, does it really make sense to make calculations at shorter periods? Would we always return the average usage of the last minute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation runs the script everytime this method is called.
How often is this method actually called?
This method would return the average usage between two calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be consistent on what data we report. Eg: all the averages should be computed on the same time window, across all the brokers, to make a meaningful comparison.
Now, if a broker see that its own load has not changed from what it has already reported on ZK (+- x%), it will not update ZK (for a certain time), but when it does, it should report the same avg over same time window.
From an API perspective, the semantic of BrokerHostUsage
should not change if we call the same method twice in short time.
In current case it's maybe true that the getBrokerHostUsage()
is called exactly every 1min, but sometimes happens that some new component that need the same information will start calling the same method... and that will start giving fun results.. (we've seen some of these things :) )
Because of that, I was thinking to use an executor and schedule the task to update the rates and the getBrokerHostUsage()
will just return the latest computed rates.
@@ -4,4 +4,4 @@ | |||
# Since the original script is system dependent and also requires /home/y location, this script is the dummy one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove this script as well.
1116970
to
661c481
Compare
@merlimat do these last changes seem ok? I've reused |
661c481
to
857082a
Compare
} | ||
|
||
private class CpuStat { | ||
private Long totalTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use 2 long
here
@@ -177,7 +177,7 @@ public SimpleLoadManagerImpl(PulsarService pulsar) { | |||
placementStrategy = new WRRPlacementStrategy(); | |||
lastLoadReport = new LoadReport(pulsar.getWebServiceAddress(), pulsar.getWebServiceAddressTls(), | |||
pulsar.getBrokerServiceUrl(), pulsar.getBrokerServiceUrlTls()); | |||
brokerHostUsage = new BrokerHostUsage(pulsar); | |||
brokerHostUsage = new LinuxBrokerHostUsageImpl(pulsar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will generate errors when running on MacOS. We should have a basic factory that returns a generic host usage implementation when not running on Linux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, we don't need precise load information for MacOS, since it's not meant to be used for prod.. just not throwing exception or printing error logs, at least for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative would be to also use OperatingSystemMXBean
for cpu usage collection, it exposes getSystemCpuLoad()
which the recent cpu usage, however, the period considered is not documented.
We could probe this method once every 1 or 2 seconds probably and take the average.
This would make load calculation more portable.
Nic usage however, would remain linux-specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use the mbean based CPU usage for the "generic os" implementation and keep using /proc/cpustat on Linux.
For mac, we could omit he NIC traffic until we find a reasonable way to retrieve that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use the mbean based CPU usage for the "generic os" implementation and keep using /proc/cpustat on Linux.
For mac, we could omit he NIC traffic until we find a reasonable way to retrieve that.
return new CpuStat(total, total - idle); | ||
} catch (IOException e) { | ||
LOG.error("Failed to read CPU usage from /proc/stat", e); | ||
return new CpuStat(0L, 0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what kind of usage % would we get when we fallback to (0, 0)
?
Would it look like 0% or 100% ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd get "divide by 0 error" at line 88, since the CPU time diff would be 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, i'll return null instead, as there's no other easy way around this, returning the same values will always end up in dividing by zero.
857082a
to
7e34b3a
Compare
7e34b3a
to
44ef30b
Compare
@merlimat pushed a generic implementation, could you review this again please? |
44ef30b
to
e550068
Compare
this.lastCollection = 0L; | ||
this.totalCpuLimit = getTotalCpuLimit(); | ||
pulsar.getLoadManagerExecutor().scheduleAtFixedRate(this::checkCpuLoad, 0, CPU_CHECK_MILLIS, TimeUnit.MILLISECONDS); | ||
pulsar.getLoadManagerExecutor().scheduleAtFixedRate(this::calculateBrokerHostUsage, 0, hostUsageCheckInterval, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hostUsageCheckInterval
is actually in minutes. Probably it'd be better to keep the time unit in the variable name to make it more explicit hostUsageCheckIntervalMin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
|
||
private void checkCpuLoad() { | ||
cpuUsageSum += systemBean.getSystemCpuLoad(); | ||
cpuUsageSum++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpuUsageCount++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@@ -441,8 +436,7 @@ public void testNamespaceBundleStats() { | |||
|
|||
@Test | |||
public void testBrokerHostUsage() { | |||
when(pulsar1.getConfiguration().getLoadBalancerHostUsageScriptPath()).thenReturn("usageScript"); | |||
BrokerHostUsage brokerUsage = new BrokerHostUsage(pulsar1); | |||
BrokerHostUsage brokerUsage = new LinuxBrokerHostUsageImpl(pulsar1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should either fallback to generic host usage impl or disable the test on non-linux.
40886e9
to
4760b0f
Compare
4760b0f
to
2443f75
Compare
@merlimat tests are green now, can you give this a last review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for fixing this!
…r. Also consolidate all info in worker in FunctionRuntimeInfo (apache#88)
* Fix memory leak and optimize memory usage *Motivation* There is a memory leak in MessageRecordUtils. *Modifications* - Fix the memory leak in MessageRecordUtils - Avoid memory allocation when serializing the response - Optimize the reference counting for kafka request holding the payload reference * fix checkstyle * add mem conf into bin/kop Co-authored-by: Jia Zhai <zhaijia@apache.org>
Motivation
Issue #81
I've made a first go a this, I believe it's working ok, please review 😊
Modifications
I've removed support for usage script and made
BrokerHostUsage
calculate metrics from system data.Current implementation only supports Linux, we should probably find a way to implement the same functionallity for other OS.
Result
Broker can collect usage metrics without need for a script.