[PR-1]Node level metrics[latency and data rate] measurement on Kernel Servers #794

VenuReddy2103 · 2019-06-04T10:20:39Z

This PR is to measure the latency and data rate from each node to each other node available and give these metrics to OMS in existing heartbeat between kernel server and OMS. These received metrics handling on OMS is not part of this PR.
This PR is same as old PR #742. Just raised based on new fork and closing the old PR.

quinton-hoole · 2019-06-05T02:29:12Z

Thanks @VenuReddy2103. Let me know when this is ready for review.

quinton-hoole · 2019-06-05T02:30:20Z

Thanks @VenuReddy2103 . Let me know when this is ready for review.

quinton-hoole · 2019-06-07T01:28:56Z

Sorry @VenuReddy2103 I was busy on other things today. Will try to review tomorrow.

quinton-hoole · 2019-06-08T00:11:30Z

@VenuReddy2103 I tried to review this today, but it's essentially impossible.
I can't work out the relationship between this and #742, which I've already reviewed extensively.
And the issues mentioned in the previous review of #742 appear not to have been addressed yet.

Please self review this, explain how you would like me to review this, and explicitly request review (with a written comment) when this is ready for my review. In the mean time I'm removing myself from the reviewers list.

quinton-hoole · 2019-06-17T22:17:46Z

@VenuReddy2103 Still no reply to #794 (comment) ?

VenuReddy2103 · 2019-07-02T14:57:30Z

Data rate calculation was not correct. Have fixed it. Will test further to find issues and fix. Will notify when ready for review

quinton-hoole · 2019-07-02T18:22:34Z

core/src/main/java/amino/run/kernel/client/KernelClient.java

@@ -36,8 +36,9 @@
    /* TODO: Need to check if size of data need to be increased to get the better data transfer rates. Sometimes, empty
    heartbeats are taking longer time than heartbeat with 2k arbitrary data. Thus, resulting in negative data transfer
    rate */
-    private static final int RANDOM_DATA_SIZE = 2 * 1024;
-    private static final byte[] randomBytes = new byte[RANDOM_DATA_SIZE];
+    private static final int DATA_SIZE_IN_MB = 10; // size in MegaBytes


You still need to explain here why 10MB has been chosen.

Have modified it to start with 1k of data and increase by step size if the data transfer time is not significant when compared to latency.

quinton-hoole · 2019-07-02T18:23:23Z

core/src/main/java/amino/run/kernel/client/KernelClient.java

@@ -267,13 +268,10 @@ public void measureServerMetrics() {
                /* Measure data transfer rate by sending some arbitrary data */
                server.receiveHeartBeat(randomBytes);
                long t3 = System.nanoTime();
-                metric.latency = (t2 - t1) / 2;
+                metric.latency = (t2 - t1); // RTT


Latency needs to be 1-way latency, so the original code was correct.

Latency is not implicitly equal to half of RTT, because delay may be asymmetrical between two given endpoints. As we make RPC calls to destination and then wait for an acknowledgement/result to come back before making further calls, RTT may be better ?
Was searching about it and found few links treating that way -
https://www.sas.co.uk/blog/what-is-network-latency-how-do-you-use-a-latency-calculator-to-calculate-throughput
Let me know your opinion

quinton-hoole · 2019-07-02T23:41:10Z

core/src/main/java/amino/run/kernel/client/KernelClient.java

-                metric.rate =
-                        ((RANDOM_DATA_SIZE * 8 * 1000.0 * 1000.0)
-                                / (((t3 - t2) - (t2 - t1)) * 1024.0 * 1024.0));
+                metric.rate = (8.0 / DATA_SIZE_IN_MB) / ((t3 - t2)/(1000.0 * 1000.0 * 1000.0));


This is wrong, or at best very confusing. You want bytes per second.

So:

DATA_SIZE_IN_BYTES/(t3 - t2)/TimeUnit.SECONDS.toNanos(1);

?
Of course, beware of integer arithmetic.

Unit is in Mbps(MegaBits/Sec)

…s latency now

VenuReddy2103 · 2019-07-11T14:56:19Z

Still working on this PR. Progress so far -

Have maintained the heartbeat data length to be sent to each server in existing KernelClient.KernelServerInfo .
Same random byte array is used to send data all the servers but with respective data length maintained for the particular server.
Have the same initial heartbeat data length set for all the servers. Increase this length by step size length each time data transfer rate < latency until it reaches the appropriate length where data transfer rate is significant.
Have reviewed and tested it with different link speeds between kernel servers.

Yet to do:
Current heartbeat frequency is 1second. And we send heartbeats and measure latency and data rates to all the available servers at that time. Need to optimize this process.

VenuReddy2103 · 2019-07-18T15:45:57Z

Metrics measurement process is independent for each server. And the frequency of measurement is also different. Following mechanism is used:

metricsTimer and metricPollPeriod is maintained per kernel server(in KernelClient.KernelServerInfo). Initially started with MIN_METRIC_POLL_PERIOD_MS.
As the Latency and data rates are consistent for MIN_STABLE_DATA_RATE_TIMES samples, we increase the time to measure metrics(metricPollPeriod) by twice. Continue to do the same till the metricPollPeriod becomes MAX_METRIC_POLL_PERIOD_MS. MAX_METRIC_POLL_PERIOD_MS ensures that metrics measurement frequency do not exceed this time.
When data transfer rate < latency is observed, decrease the metricPollPeriod to MIN_METRIC_POLL_PERIOD_MS

quinton-hoole · 2019-07-19T23:13:55Z

core/src/main/java/amino/run/kernel/client/KernelClient.java

+            between client and server, took the round trip time(RTT) as latency
+             */
+
+            /* Make an empty heartbeat to ensure session is established and cached before measurement */


nit: This is a bit messy, but I guess necessary.

quinton-hoole · 2019-07-19T23:37:26Z

core/src/main/java/amino/run/kernel/client/KernelClient.java

+            /* Measure data transfer rate by sending some arbitrary data */
+            server.receiveHeartBeat(serverInfo.data);
+            long t3 = System.nanoTime();
+            if (t3 - t2 < t2 - t1) {


Please add a comment to describe what you're doing in this block, e.g. "data too small relative to line latency - try again immediately with one more step of data".

Added a comment

quinton-hoole · 2019-07-19T23:49:07Z

core/src/main/java/amino/run/kernel/client/KernelClient.java

+            server.receiveHeartBeat(serverInfo.data);
+            long t3 = System.nanoTime();
+            if (t3 - t2 < t2 - t1) {
+                period = MIN_METRIC_POLL_PERIOD_MS;


With the current values and logic, this is not going to work very well.

For example, for a 10GBit ethernet, with say 50ms latency (a fairly common link between data centers), you will need to send around about 1 GByte of data (which will take about 1 second) for the round trip latency to be about 10% of the total transfer time (so that the measurement is off by less than about 10%).

Because you are only starting with 1Kbyte, and incrementing by 1KByte every 1 second, it will take 1 million seconds to get up to 1 Gbyte. That's about 12 days . And in the process you will waste many, many, many GBytes of transferred data getting there.

I would suggest that you come up with a better plan. One suggestion might be to double the data size every time (and also perhaps start with a bigger number than 1K).

PS I think slow 3G links are a few hundred Kbit/s to 1Mbit/sec, and latencies around 100ms.

https://www.ofcom.org.uk/about-ofcom/latest/media/media-releases/2014/3g-4g-bb-speeds

So the smallest amount of data you're ever likely to need to send (using similar assumptions to above) is around 128KBytes. You may as well start just below that (say 32Kbytes) and keep doubling it until your data transmission time is significantly greater than your basic line latency (10x in the above example).

In that case, to get from 2^7K (128K) to 2^20K (1G) would take 20-7 = 13 attempts. That seems fine. Way better than a million above.

Agreed. Doubling the amount of data until your data transmission time is greater than line latency.

quinton-hoole · 2019-07-19T23:53:59Z

core/src/main/java/amino/run/kernel/client/KernelClient.java

+                    >= KernelServerInfo.MIN_STABLE_DATA_RATE_TIMES) {
+                /* Data rates are consistent. Double the poll period */
+                period = period << 1;
+                /* Limit max poll period to 512 sec */


nit: Limit the max poll period to MAX_METRIC_POLL_PERIOD_MS

Updated the comment

quinton-hoole · 2019-07-19T23:55:45Z

core/src/main/java/amino/run/kernel/client/KernelClient.java

@@ -28,41 +34,180 @@
 * @author iyzhang
 */
 public class KernelClient {
+    private static final int MIN_METRIC_POLL_PERIOD_MS = 1000; /* Minimum metrics poll period */
+    private static final int MAX_METRIC_POLL_PERIOD_MS = 512000; /* Maximum metrics poll period */


I think this is probably too long? It will take almost 10 minutes to detect a faster or slower link? I think we should aim for a faster response than that. Either way, this number needs to be properly considered, and you need to document here why you chose the given number.

Since this heartbeat between kernel servers is to just measure latency and transfer rate(i.e., not liveliness), and once we get the stable samples, we can reduce the frequency of this heartbeats gradually.
If the remote kernel server is down due to some reason, anyway, any rpc calls to that particular server fails, then contacts respective DM group policy to get the available server, uses it.
KernelServerInfo gets removed from this servers map after MAX_FAILED_HEARTBEATS

I think you missed my point. If a node moves between networks of different speeds (e.g. from wifi to 3G or vice verse) or if it's link simply gets congested, then it will take until the next metric poll period for that change to be detected. If the metric poll period is 512 seconds, then that's almost 10 minutes before the change gets detected. That's way too long.

Got it. Have reduced it to 128 seconds now. From the test results, amount of data to send in heartbeats to measure data rate is much negligible when compared to link capacity. So, I believe 128sec(~2min) as upper bound is ok.

quinton-hoole · 2019-07-19T23:58:32Z

core/src/main/java/amino/run/kernel/client/KernelClient.java

+            }
+
+            serverInfo.stableDataRateTimes++;
+            metric.latency = (t2 - t1); // RTT


Surely you should set this irrespective of whether the sample was ignored above. The measured latency is always correct. The measured throughput is only correct in some cases.

Actually node metric object has latency and data rate values in it. So ignoring this latency when data rate cannot be calculated. Otherwise, we need to send node metric object(sample) to OMS with valid latency and data rate value being 0.
Since, this case can happen for few initial times until data size is significant enough to get data transfer time > latency, ignoring the sample.

quinton-hoole · 2019-07-20T00:05:20Z

core/src/main/java/amino/run/kernel/client/KernelClient.java

+        KernelServer server = serverInfo.remoteRef;
+        int dataLength = serverInfo.data.len;
+        NodeMetric metric = serverInfo.metric;
+        int period = serverInfo.metricPollPeriod;


I think you'll find that if you remove this alias, your logic will get a lot simpler. Just use serverInfo.metricPollPeriod directly? Similar argument for above 3 aliases, perhaps.

quinton-hoole · 2019-07-20T00:19:03Z

core/src/main/java/amino/run/kernel/common/ServerInfo.java

@@ -74,4 +79,47 @@ private boolean matchRequirements(List<Requirement> requirements) {
        }
        return !requirements.isEmpty();
    }
+
+    private void writeObject(java.io.ObjectOutputStream s) throws java.io.IOException {


I don't understand why this method is here and so weird. Please comment to explain why you didn't do the standard stuff:

https://www.baeldung.com/java-serialization

Added header and comments in it.

No you didn't.

quinton-hoole · 2019-07-20T01:17:46Z

core/src/main/java/amino/run/kernel/server/KernelServer.java

+     * @param data
+     * @throws RemoteException
+     */
+    void receiveHeartBeat(RandomData data) throws RemoteException;


It would be good to add comments explaining why it doesn't just take a String as a parameter. Superficially it looks quite silly to create a whole new data type just for this purpose. I understand you're probably trying to make the serialization and deserialization more efficient or something, but I think there's probably a much better way (e.g. by using a String.subString() on the fly to pass in.

Added comments for RandomData class.

quinton-hoole · 2019-07-20T01:19:59Z

core/src/main/java/amino/run/kernel/server/KernelServerImpl.java

+     * @param data
+     */
+    @Override
+    public void receiveHeartBeat(RandomData data) {}


Come to think of it, the compiler/JIT might optimize this data away because it's never used. Might be worth checking that. I'm not sure what the best solution is if it is being optimized away. You might need to access it somehow, without using up too many CPU cycles.

Have checked it. Call is not optimized. Probably because it is an RPC. And data is also deserialized on remote end.

quinton-hoole

See inline comments for requested changes.

Vishwa4jeet · 2019-08-19T12:56:27Z

Link speed between the two systems used for testing- 1000 Mbps.

KS1 192.168.59.2 running on system1 along with oms.
Time taken for Data Rate to stabilize from KS1 to KS2: 101 seconds
Final Data Length used in heartbeats: 65536

KS2 192.168.59.4 running on system2 .
Time taken for Data Rate to stabilize from KS2 to KS1: 63 seconds
Final Data Length used in heartbeats: 32768

Data rate unit is in Bytes/Sec.
Latency is in nanoseconds (ns).

PFA the test logs below:
ks-192.168.59.2.log
ks-192.168.59.4.log

quinton-hoole · 2019-08-19T23:39:27Z

core/src/main/java/amino/run/kernel/client/KernelClient.java

@@ -28,41 +38,256 @@
 * @author iyzhang
 */
 public class KernelClient {
+    private static final int MIN_METRIC_POLL_PERIOD_MS = 1000; /* Minimum metrics poll period */
+    private static final int MAX_METRIC_POLL_PERIOD_MS = 128000; /* Maximum metrics poll period */


You still don't explain here in comments why these numbers were chosen. They seem fairly arbitrary.
Put yourself in the position of someone coming after you to change the code. They will ask themselves, "can I change this number, to what should I change it, and why".

Added the comment in code. Please let me know if it is ok

…ts local kernel server is not registered to OMS. And also, kernel servers do metric measurement to the kernel server collocated within OMS

VenuReddy2103 · 2019-09-04T06:48:51Z

Have fixed APP Client not exiting issue. APP client creates a dummy local kernel server which are meant to route RPC calls through it to remote kernel server where MicroService it interacts reside. But that local kernel server is not registered to OMS and do not send heartbeats to OMS. We were measuring node metrics from that dummy local kernel server to all the remaining remote kernel servers. In fact, Such App clients do not allow deployment of MicroServices on them(and also do not participate in automatic migration of MicroService). Hence, they shouldn't measure node metrics.

Fundmover app logs:
fundmover-app.log
fundmover-ks1.log
fundmover-ks2.log
fundmover-oms.log

HanksTodo app logs: Have 4 Kernel servers with 2 servers in each region.
hankstodo-app.log
hankstodo-ks1.log
hankstodo-ks2.log
hankstodo-ks3.log
hankstodo-ks4.log
hankstodo-oms.log

KVStore app logs: Have 4 Kernel servers with 2 servers in each region.
kvstore-app.log
kvstore-ks1.log
kvstore-ks2.log
kvstore-ks3.log
kvstore-ks4.log
kvstore-oms.log

quinton-hoole · 2019-09-04T15:51:21Z

OK, to avoid further delays I'm going to merge this PR, and make the proposed improvements in followup PRs.

This was referenced Jun 4, 2019

[PR-3]Storing Kernel Server Metrics and Microservice Metrics on OMS and provide get metric APIs to decision making module #796

Merged

[PR-1]Node level metrics[latency and data rate] measurement on Kernel Servers #742

Closed

VenuReddy2103 requested a review from quinton-hoole June 6, 2019 13:00

quinton-hoole removed their request for review June 8, 2019 00:11

quinton-hoole assigned VenuReddy2103 Jun 8, 2019

quinton-hoole mentioned this pull request Jun 8, 2019

[PR-2]Microservice level metrics[data IN/OUT, RPC processing time] measurment and collection at DM's server policy maintained per client #795

Merged

AmitRoushan mentioned this pull request Jun 14, 2019

[PR-4]Decision making module for automatic migration #793

Open

VenuReddy2103 force-pushed the get-kernel-servers branch from 46e6632 to 791f3fe Compare June 18, 2019 17:22

quinton-hoole mentioned this pull request Jun 28, 2019

Current logging enhancement #810

Open

VenuReddy2103 added 2 commits July 1, 2019 14:05

Latency and data rate measurement

431ad65

Rework of comments

e4ee0dc

VenuReddy2103 force-pushed the get-kernel-servers branch from 791f3fe to e221e23 Compare July 2, 2019 14:29

VenuReddy2103 changed the title ~~[PR-1]Node level metrics[latency and data rate] measurement on Kernel Servers~~ [WIP][PR-1]Node level metrics[latency and data rate] measurement on Kernel Servers Jul 2, 2019

quinton-hoole reviewed Jul 2, 2019

View reviewed changes

Fixed data rate calculation issue. Also considering round trip time a…

6759829

…s latency now

VenuReddy2103 force-pushed the get-kernel-servers branch from e221e23 to 6759829 Compare July 3, 2019 12:09

Data rate calculation modified

ffc46b5

VenuReddy2103 force-pushed the get-kernel-servers branch 2 times, most recently from 67185e2 to 77bea4c Compare July 16, 2019 19:00

VenuReddy2103 changed the title ~~[WIP][PR-1]Node level metrics[latency and data rate] measurement on Kernel Servers~~ [PR-1]Node level metrics[latency and data rate] measurement on Kernel Servers Jul 18, 2019

quinton-hoole reviewed Jul 19, 2019

View reviewed changes

quinton-hoole reviewed Jul 20, 2019

View reviewed changes

quinton-hoole requested changes Jul 20, 2019

View reviewed changes

VenuReddy2103 force-pushed the get-kernel-servers branch 6 times, most recently from 248871c to f71b126 Compare August 15, 2019 18:21

VenuReddy2103 force-pushed the get-kernel-servers branch from f71b126 to f4841a7 Compare August 19, 2019 13:08

rework set-1

c7a4e5d

VenuReddy2103 force-pushed the get-kernel-servers branch from f4841a7 to c7a4e5d Compare August 19, 2019 13:27

quinton-hoole reviewed Aug 19, 2019

View reviewed changes

rework set-2

177a3ff

VenuReddy2103 force-pushed the get-kernel-servers branch from 51f5f4b to 177a3ff Compare August 20, 2019 13:25

Issue Fix. After fix, APP Client do not start metric measurement if i…

a92fd70

…ts local kernel server is not registered to OMS. And also, kernel servers do metric measurement to the kernel server collocated within OMS

quinton-hoole approved these changes Sep 4, 2019

View reviewed changes

quinton-hoole merged commit fb68570 into amino-os:master Sep 4, 2019

quinton-hoole mentioned this pull request Sep 6, 2019

Rebase pr4 decision making module #815

Closed

[PR-1]Node level metrics[latency and data rate] measurement on Kernel Servers #794

[PR-1]Node level metrics[latency and data rate] measurement on Kernel Servers #794

Conversation

VenuReddy2103 commented Jun 4, 2019 • edited Loading

quinton-hoole commented Jun 5, 2019

quinton-hoole commented Jun 5, 2019

quinton-hoole commented Jun 7, 2019

quinton-hoole commented Jun 8, 2019

quinton-hoole commented Jun 17, 2019

VenuReddy2103 commented Jul 2, 2019 • edited Loading

Choose a reason for hiding this comment

VenuReddy2103 Jul 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VenuReddy2103 commented Jul 11, 2019

VenuReddy2103 commented Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VenuReddy2103 Aug 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VenuReddy2103 Aug 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quinton-hoole Jul 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VenuReddy2103 Aug 14, 2019 • edited Loading

Choose a reason for hiding this comment

quinton-hoole left a comment

Choose a reason for hiding this comment

Vishwa4jeet commented Aug 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VenuReddy2103 commented Sep 4, 2019

quinton-hoole commented Sep 4, 2019

VenuReddy2103 commented Jun 4, 2019 •

edited

Loading

VenuReddy2103 commented Jul 2, 2019 •

edited

Loading

VenuReddy2103 Jul 16, 2019 •

edited

Loading

VenuReddy2103 commented Jul 18, 2019 •

edited

Loading

VenuReddy2103 Aug 12, 2019 •

edited

Loading

VenuReddy2103 Aug 14, 2019 •

edited

Loading

quinton-hoole Jul 19, 2019 •

edited

Loading

VenuReddy2103 Aug 14, 2019 •

edited

Loading

Vishwa4jeet commented Aug 19, 2019 •

edited

Loading