Architecture issues #41

Closed
vgololobov opened this Issue Mar 24, 2013 · 8 comments

Projects

None yet

3 participants

@vgololobov

My question is about Directory service and its meaning

According to wiki, to join a cluster I should specify address of a directory node (as I understand it should coordinate connected nodes) - but it just forwards all requests to registry and basicly becomes a redundant entity

More over - registry does not track a node state (I'm talking about redis adapter, don't know about other adapters) and stores dead nodes for infinite time

There is no timeouts, connection check, etc - this means if try to connect to a dead node, which we can find in a registry, we will wait for it forever

I want to know what was the original idea ?

What responsibilities of a Directory service ?

Who is responsible for Node state tracking ?

If Directory service should coordinate connected nodes - it means, we don't have to be connected to registry - instead, we should forward all the registry's calls to the Directory service , right ?

In case if a different Directory nodes connected to same Registry - should nodes from one cluster be discoverable by nodes from another one ?

@spangenberg

I had the same question two weeks ago when I dived into the project.

I think there was this directory functionality in the gossip adapter, correct me when I'm wrong.

But currently there is only a stuped single point of failure registry as a key value store.

I'm digging into it and try to create a new registry which lives on every node and tries to set up a routing table, checks health of node and so on ...

Maybe we should talk together, sounds like you have similar needs.

Best
Daniel

@vgololobov

IMHO - routing table (more likely DHT) is overcomplicated for this task and makes it hard to have a clusters of nodes

Currently each node sends a heartbeat message every 10 seconds and to have a Directory node which will coordinate all connected nodes is a good idea

In case if it fall off, the only thing you will lose - is ability to find out connected to cluster nodes, all existing communications will remain functional

Couple thing I think will be great to add:

  • connection timeout
  • connection state tracking (using heartbeats)
  • ability to create different types of connections, like pub/sub (like claimed in wiki)
  • load balanced cluster of nodes
@tarcieri
Member

The issue tracker is for bugs, not open-ended discussion. I guess I'm going to call this a dupe of #6?

Can you broach these issues on the mailing list? You seem to be missing quite a bit from the codebase as well: DCell most certainly has a heartbeat system for detecting downed nodes.

https://github.com/celluloid/dcell/blob/master/lib/dcell/node.rb#L108

@tarcieri tarcieri closed this Mar 24, 2013
@vgololobov

This was not a discussion.
Issues are:

  • Directory is not functional at all (duplicates registry calls )
  • Detecting down nodes not functional , the only thing it does is just
        Celluloid::Logger.warn "Communication with #{id} interrupted" 

I just said that DCell has a heartbeat system but not use it

@tarcieri
Member

Okay, so this is just a dupe of #6 then, or what?

@vgololobov

No, issue addressed to this ticket is not down nodes
Is about what to do with Directory service and node clustering ?

Add functionality according to wiki or remove it as a dup of Registry

@tarcieri
Member

Okay, I'm a bit confused because you're telling me a lot of different things in this ticket so I'm having trouble nailing down anything specific...

The issue is: the code in Directory merely delegates to the active Registry and is therefore unnecessary? Probably.

DCell uses heartbeats to determine the liveliness of nodes. Every node is a Celluloid::FSM, and the heartbeat system triggers state transitions which mark the node as down.

When nodes are marked as down, existing DCell RPC calls are cancelled, and new ones will fail until the nodes are marked back up. If you are having trouble with this behavior, I suggest you open another issue.

@vgololobov
The issue is: the code in Directory merely delegates to the active Registry and is therefore unnecessary? 

This is the point of this ticket

Remove or rewrite it to handle clustering as it described in wiki

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment