Discuss: BEP-14 #40

ttmc · 2018-05-09T07:27:11Z

The purpose of this issue is to serve as a place to openly discuss the specifics of BEP-14.

tucanae47 · 2018-05-14T13:59:26Z

Hey guys, so i drafted a class using some of your code.

The idea of the implementation in mind is based on Alberto suggested implementation and optimization, basically I'm prioritizing nodes by low latency using heapq, keeping it sorted by elapsed time of the current request, and initializing the heap (load_latency) using the elapsed time of a HEAD request to every node in the list.

from .exceptions import HTTP_EXCEPTIONS, TransportError
from collections import namedtuple
from heapq import heappush, heappop
from requests import Session

HttpResponse = namedtuple('HttpResponse', ('status_code', 'headers', 'data'))
DELAY = 1000 * 10


class LoadBalancer:
    """Connection load balancer """

    def __init__(self, *, nodes, headers=None):
        self.heap = []
        self.nodes = []
        self.tries = 0
        self.session = Session()
        self.init_nodes(nodes)
        self.load_latency()

    def init_nodes(nodes):
        if headers is None:
            for url in nodes:
                self.nodes.append({'endpoint': url, 'headers': None})
        elif isinstance(nodes[0], basestring):
            self.nodes = []
            for url in nodes:
                self.nodes.append({'endpoint': url, 'headers': headers})
        elif isinstance(nodes[0], dict):
            self.nodes = nodes

    def load_latency(self):
        # init all nodes with current latency using a HEAD request
        for data in self.nodes:
            url = data.endpoint
            header = data.headers
            response = self.session.request(
                url=url,
                method='HEAD',
                headers=header,
            )
            td = response.elapsed
            millis = td.microseconds/1000.0
            heappush(self.heap,(millis, (data, millis))

    def load_node(self, method, path=None, json=None,
                  params=None, headers=None, **kwargs):
        while not self.tries :
            best_node = heappop(self.heap)
            time = best_node[1]
            url = best_node[0].endpoint
            response = self.session.request(
                method=method,
                url=url,
                json=json,
                params=params,
                headers=headers,
                **kwargs
            )
            text = response.text
            if not (200 <= response.status_code < 300):
                # try again, dont try this node on a DELAY while
                self.tries += 1
                time = time + DELAY
                heappush(self.heap, (time,  (best_node[0], time ) )
            elif self.tries > 2:
                # we are done trying raise error
                self.tries = 0
                exc_cls = HTTP_EXCEPTIONS.get(
                    response.status_code,
                    TransportError)
                raise exc_cls(response.status_code, text, json)
            else:
                # update the node with new elapsed time
                # and return the actual response
                self.tries = 0
                time = response.elapsed.microseconds / 1000.0
                heappush(self.heap, (time, (best_node[0], time )))
                try:
                    json = response.json()
                except ValueError:
                    json = None
                data = json if json is not None else text
                return HttpResponse(response.status_code,
                    response.headers, data)

please let me know your opinions.

vrde · 2018-05-15T10:25:16Z

Hi Camilo, thanks for the preview of the Python code, I really like the idea to use heapq. It would be nice to add to the BEP this other strategy, that prioritize nodes with low latency.

If you want to test your implementation on a live network, you can create an account in our BigchainDB Testnet. Once you sign up for the test network, you'll get a token you have to the headers of your request. The test network runs four nodes, all running on Azure, West Europe, so they might have similar latency:

I have a couple of questions related to this implementation:

At what time do you expect to run the load_latency method?
Latency changes over time, what if a node that was slow in responding to load_latency was just temporary under heavy load?

tucanae47 · 2018-05-15T11:32:04Z

Hi, thanks!. So answering your questions we have:

- At what time do you expect to run the load_latency method?
load_latency is expected to run at init method of the class, at this point is just sort of a discovery of the latency status of the nodes of the network.

- Latency changes over time, what if a node that was slow in responding to load_latency was just temporary under heavy load?
Yeah, so to explain better i wrote little pseudocode, every time we execute a successful request or not, we update the latency of that node in the heap as follows:

    def heap_algo(self ):
        best_node = heappop(self.heap)
        response = request(best_node)
        if response is not ok:
            # try again, dont try this node on a DELAY while
            time = time + DELAY
            heappush(self.heap, (time,  (best_node, time ) )
        else:
            # update the node with new elapsed time
            time = response.elapsed.microseconds / 1000.0
            heappush(self.heap, (time, (best_node, time )))

the headpop method retrieve the node from the heap and after the request we put it back (heappush) with updated latency, so that keeps latency updated for every node during the algorithm execution.

tucanae47 · 2018-05-15T12:34:45Z

Also following are the updated method and some new ones of the initial code, which should fix the incorrect logic you mentioned during the call. With this code we keep track of the amount of failed tries per node, so after we tried every node 3 times we raise an error "All nodes are unreacheable"

    def __init__(self, *, nodes, headers=None):
        #....
        self.tries_per_node = {}
        self.init_nodes(nodes)
        self.init_tries_per_node()
 
    def init_tries_per_node(self):
        for node in self.nodes:
            tries_per_node[node.endpoint] = 0

    def update_tries(self, node):
        tries_per_node[node.endpoint] += 1

    def enough_tries(self):
        for node in tries_per_node:
            if tries_per_node[node] < 3:
                return False
        return True

    def request(self, method, path=None, json=None,
                  params=None, headers=None, **kwargs):
        while is not self.enough_tries():
              node_response = self.load_node(method, path=None, json=None,
                  params=None, headers=None, **kwargs)
              if node_response is not None:
                  return node_response
        # raise an error. "All nodes are unreacheable"

    def load_node(self, method, path=None, json=None,
                  params=None, headers=None, **kwargs):
            best_node = heappop(self.heap)
            time = best_node[1]
            url = best_node[0].endpoint
            response = self.session.request(
                method=method,
                url=url,
                json=json,
                params=params,
                headers=headers,
                **kwargs
            )
            text = response.text
            if not (200 <= response.status_code < 300):
                # try again, dont try this node on a DELAY while
                time = time + DELAY
                heappush(self.heap, (time,  (best_node[0], time ) )
                self.update_tries(best_node)
                return None
            else:
                # update the node with new elapsed time
                # and return the actual response
                time = response.elapsed.microseconds / 1000.0
                heappush(self.heap, (time, (best_node[0], time )))
                try:
                    json = response.json()
                except ValueError:
                    json = None
                data = json if json is not None else text
                return HttpResponse(response.status_code,
                    response.headers, data)

Of course not final code, but just to correct the draft and move forward. let me know any further questions :)

vrde · 2018-05-16T21:27:46Z

I have some further questions about this implementation.

@vrde: At what time do you expect to run the load_latency method?

@tucanae47: load_latency is expected to run at init method of the class, at this point is just sort of a discovery of the latency status of the nodes of the network.

This means that a user of the driver might experience a delay proportional to the number of nodes passed at initialization, because of the blocking consecutive requests to estimate the round-trip time.

In the worst case where one or more nodes are unavailable, the delay the user might experience is as high as the socket timeout times the number of unavailable nodes.

There is another detail that might not be obvious (it was not obvious to me, at least): measuring the time for a HEAD request might not be the best metric, since it does not involve any processing for the request. Still, this is a secondary problem.

My suggestion to fix this would be to not use load_latency at all. First, we initialize all elements in the heap with a latency of 0. On the first load_node, the driver will pick one of them with heappop (we don't care which one), and put it back to the heap with heappush, updating the latency. Since the new latency is for sure bigger than zero, the next load_node will pick another node with latency zero, and update it. After load_node has been called N times, where N is the number of available nodes, the N+1 call of load_node will be on the node with lowest latency.

@vrde: Latency changes over time, what if a node that was slow in responding to load_latency was just temporary under heavy load?

@tucanae47: Yeah, so to explain better i wrote little pseudocode, every time we execute a successful request or not, we update the latency of that node in the heap as follows: [...]

We have 4 nodes: A, B, C, D. After some time, we have something like: [(B, 700ms), (A, 900ms), (C, 1000ms), (D, 1200ms)]. From now on, all requests will go to B, because it's the fastest node in the heap. But what if C was under heavy load, and now it's back to normal, and it is actually faster than B?

My suggestion to fix it would be to add some logic to load_node: first we iterate over all nodes in the heap and update the elapsed time by decreasing it by some constant, then update the node that was used for the request with new elapsed time. By decreasing all response times by a constant, we give to all nodes a second chance to be called.

There is also another thing I didn't consider before: different endpoints have different response times, depending on the amount of computation the endpoint requires. Some endpoints are really fast (e.g. get a transaction) some others can be slower (e.g. pushing a transaction and waiting until it's stored in a block), so we might end up putting in the heap response times for really different jobs.

Given that, I'm wondering if we should proceed with the original, simpler proposal. Thoughts?

tucanae47 · 2018-05-18T20:40:30Z

Sorry on the late response, i agree with almost everything, just a couple of different thoughts:

We have 4 nodes: A, B, C, D. After some time, we have something like: [(B, 700ms), (A, 900ms), (C, 1000ms), (D, 1200ms)]. From now on, all requests will go to B, because it's the fastest node in the heap. But what if C was under heavy load, and now it's back to normal, and it is actually faster than B?

Yeah true, but thats assuming that hitting the same node every time, is never going to change the latency, which could contradict your next argument

different endpoints have different response times, depending on the amount of computation the endpoint requires

Having that in mind, suggests that we are going to use the all the endpoints in all the nodes. and the chances of hitting another node (not our best node) increase. but i agree with it, i mean maybe this suggest we should calculate latency per endpoint and modify the queue priorities logic.

My suggestion to fix it would be to add some logic to load_node: first we iterate over all nodes in the heap and update the elapsed time by decreasing it by some constant, then update the node that was used for the request with new elapsed time. By decreasing all response times by a constant, we give to all nodes a second chance to be called.

Another idea to your suggestion could be, we don't hit the same endpoint more than say 2-3 times consecutively, if we reach to that we use the decreasing time constant you suggest.

On the other hand

Given that, I'm wondering if we should proceed with the original, simpler proposal

Yeah ill implement the round robin approach first, after that ill compare with this method.

thank you!

tucanae47 · 2018-05-28T03:38:34Z

Hey there guys, just added a PR for BEP/14!, please let me know following steps, suggestions.

ttmc assigned vrde May 9, 2018

ttmc added the discuss existing BEP Use this issue like a forum for discussion a specific BEP that already exists label May 9, 2018

tucanae47 mentioned this issue May 28, 2018

BEP 14: Roundrobin connection strategy bigchaindb/bigchaindb-driver#430

Merged

ttmc closed this as completed Oct 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss: BEP-14 #40

Discuss: BEP-14 #40

ttmc commented May 9, 2018 •

edited

Loading

tucanae47 commented May 14, 2018 •

edited

Loading

vrde commented May 15, 2018 •

edited

Loading

tucanae47 commented May 15, 2018

tucanae47 commented May 15, 2018 •

edited

Loading

vrde commented May 16, 2018

tucanae47 commented May 18, 2018 •

edited

Loading

tucanae47 commented May 28, 2018

Discuss: BEP-14 #40

Discuss: BEP-14 #40

Comments

ttmc commented May 9, 2018 • edited Loading

tucanae47 commented May 14, 2018 • edited Loading

vrde commented May 15, 2018 • edited Loading

tucanae47 commented May 15, 2018

tucanae47 commented May 15, 2018 • edited Loading

vrde commented May 16, 2018

tucanae47 commented May 18, 2018 • edited Loading

tucanae47 commented May 28, 2018

ttmc commented May 9, 2018 •

edited

Loading

tucanae47 commented May 14, 2018 •

edited

Loading

vrde commented May 15, 2018 •

edited

Loading

tucanae47 commented May 15, 2018 •

edited

Loading

tucanae47 commented May 18, 2018 •

edited

Loading