-
Notifications
You must be signed in to change notification settings - Fork 476
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
761 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,283 @@ | ||
$Id: DESIGN,v 1.1 2006/05/20 21:42:13 belaban Exp $ | ||
|
||
|
||
|
||
|
||
|
||
Design Issues | ||
============= | ||
|
||
|
||
|
||
|
||
|
||
Correct determination of left members (UNICAST and others) | ||
---------------------------------------------------------- | ||
|
||
In UNICAST we have a connection hashtable which stores the sender's or | ||
receiver's address and an entry for the connection to/from that | ||
address. Whenever a view change is received, we remove members from | ||
that connection table that are not in the view. However, this | ||
introduces the following bug. | ||
|
||
Consider the case where we have 2 members in the group: {A, B}. Now C | ||
and D want to join simultaneously. Let's assume C is joined first, | ||
then D (JGroups doesn't join more than 1 member at a time). This | ||
will result in view {A, B, C} which is sent to A, B and C (not to D | ||
since D is not a member yet). Let's also assume that there is already | ||
an entry for D in A's connection table: if A is the coordinator, this | ||
entry is the result of D's JOIN request sent to A. Let's assume D's | ||
JOIN request has a seqno=1. | ||
|
||
When A receives the new view {A, B, C}, it will remove D's entry in | ||
its connection table BECAUSE IT THINKS THAT D HAS LEFT THE GROUP. Now | ||
D resends its JOIN request (with seqno=2). Since this unicast message | ||
is not tagged as first message, A will reject it. (Note that D does | ||
not retransmit the message with seqno=1 because A ack'ed it. Receivers | ||
ack all messages received even though the messages might not be | ||
accepted by their connection table). | ||
|
||
To solve this problem, we have to determine whether D really left the | ||
group, or whether it is just a new member not yet seen by the | ||
group. To do this, we maintain a 'group history' in a bounded cache | ||
which contains the last n members of the group. E.g. when C joins the | ||
cache is {A, B}. Before A removes D's entry from its connection table, | ||
it checks whether D actually was a member. Since D is not in {A, B}, | ||
its entry will not be removed. | ||
|
||
|
||
This has been modified (bela dec 23 2002): | ||
Now we do not remove an entry from the connection table in | ||
retransmit(), and only remove it in a VIEW_CHANGE when the diff | ||
between 2 view shows that a member left (see determineLeftMember in | ||
org.jgroups.util.Util). For example: first view is {A,B,C}, next | ||
view is {A,B}, then left member is C. | ||
The use case is when 2 or more members merge into a single group, | ||
e.g. A and B. Let's say A is the merge coord. At this point A needs to | ||
send a unicast message to B which is not member in A's view. Whenever | ||
we received a view change or retransmission request in A or B, each | ||
one would remove the other's connection entry from the connection | ||
table, therefore preventing the merge protocol from completing. | ||
|
||
|
||
|
||
Setting of recv/send buffer sizes in UDP (FRAG problem) | ||
------------------------------------------------------- | ||
|
||
There was a bug in FRAG which caused fragmentation of larger messages | ||
to receive only the first fragment of each message, and not further | ||
fragments. The reason was that the default send buffer size of UDP was | ||
8195: when a larger message (e.g. 30000 bytes) was to be sent, it was | ||
fragmented into 4 packets (3 * 8195 and 1 * 5415 bytes). All 4 | ||
fragments were then passed down to UDP almost simultaneously which | ||
caused the UDP send buffer to overflow which resulted in fragments 2, | ||
3 and 4 to be discarded. This explains why retransmission of fragments | ||
always correctly received the packets. Also, if there was a small | ||
timeout (e.g. 5ms) between the fragments passed down to UDP, *all* the | ||
fragments would be received. | ||
|
||
The solution is therefore to set the mcast_send_buf_size, | ||
mcast_recv_buf_size, ucast_send_buf_size and ucast_recv_buf_size in | ||
UDP explicitly, based on the frag_size value of FRAG. | ||
|
||
Note that the above problem did not cause messages to be dropped, but | ||
added a number of spurious retransmissions. | ||
|
||
|
||
|
||
Use of multiple sockets in UDP | ||
------------------------------ | ||
|
||
There are currently 3 sockets in UDP: | ||
|
||
- a MulticastSocket for sending and receiving multicast packets | ||
- a DatagramSocket for sending unicast packets and | ||
- a DatagramSocket for receiving unicast packets | ||
|
||
Normally there would only be 2 sockets: a socket for sending and | ||
receving unicast packets and a socket for sending and receiving | ||
multicast packets. The reason for the additional unicast send socket | ||
is that under Linux, when a packet is sent to a non-existent | ||
(e.g. crashed) receiver, there will be an asynchronous ICMP | ||
HOST-UNREACHABLE sent back to the sender. However, since it is | ||
asynchronous, it will not cause an exception in the current send(), | ||
but in *one of the following ones* ! Therefore, we need to close and | ||
re-open the send socket every time. | ||
|
||
This feature is due to the fact that the older JDK for Linux from | ||
Blackdown did *not* specify BSD compatibility when creating | ||
DatagramSockets, which meant that ICMP HOST-UNREACHABLE errors were | ||
actually progagated up to the interface (send()), and not just | ||
discarded, as would be expected. | ||
|
||
Newer JDKs (e.g. SUN's JDK 1.2.2 and 1.3 for Linux) don't seem to have | ||
this feature, e.g. they seem to have BSD_COMPAT set to true). An item | ||
on the todo list is to remove the code which closes and reopens the | ||
send socket (search for is_linux in UDP.java) and test. This way, we | ||
would have only 2 sockets and didn't need to close/reopen the socket | ||
on every unicast send. | ||
|
||
|
||
|
||
Why not merge all sockets into one, e.g. a MulticastSocket, for | ||
sending/receiving unicast as well as multicast packets ? | ||
-------------------------------------------------------- | ||
|
||
|
||
All MulticastSockets need to use the same IPMCAST address and port to | ||
be able to receive multicast packets. However, in JGroups, every | ||
member also needs to have its own address/port to receive unicast | ||
packets. In this case, this wouldn't be true, because the | ||
MulticastSocket only has one port, and that would be the IPMCAST | ||
port. Therefore, returning a unicast response to a multicast request | ||
would be impossible because senders cannot be discerned (they all have | ||
the same port). Example: sock=new MulticastSocket("myhost", 7500). In | ||
this case, sock.getLocalPort() returns 7500, which is true for *every | ||
member in the group*. | ||
Update (bela Sept 12 2006): even if we don't choose the mcast socket's port | ||
(new MulticastSocket()), when sending on multiple | ||
interfaces (send_on_all_interfaces="true"), we receive a different sender | ||
address for each message sent on a different NIC. This needs to be handled with | ||
*logical addresses* (see JIRA for 2.3). | ||
|
||
|
||
|
||
|
||
Sending of dest_addr and src_addr in Message | ||
-------------------------------------------- | ||
|
||
There was an idea of removing the src_addr and dest_addr from the | ||
serialized data exchanged between UDP protocols, which would reduce | ||
the size of packets. The idea was that the DatagramPacket itself would | ||
contain the sender's address (src_addr), and the receiver's address | ||
would be the one to which the packet was sent to. However, there are a | ||
number of problems. | ||
|
||
First, since we cannot have just one socket (see above), a receiver of | ||
a unicast and a multicast packet by the same sender S would see 2 | ||
different sender addresses (assuming for now, that we have 1 multicast | ||
and 1 unicast socket). Let's assume that the sender's port of the | ||
unicast socket is 5555 and its multicast port 7500. The receiver would | ||
receive a packet from S:5555 and one from S:7500, which are 2 | ||
different addresses although coming from the same sender. | ||
|
||
Second, a multicast sent to a number of receivers cannot be replied | ||
to: a receiver sees that the sender's port is 7500 and replies to that | ||
port. However, that port is used by the multicast group, therefore the | ||
sender of the multicast will not receive a unicast response. | ||
|
||
Third, a member always needs a unique <addr:port> IpAddress as local | ||
address. In the above case, we would always have 2 different | ||
IpAddresses, which causes problems identifying the member. Note that | ||
IpAddresses are used as keys into various structures, e.g. in | ||
NakReceiverWindow. | ||
|
||
|
||
|
||
|
||
UDP PacketHandler | ||
----------------- | ||
|
||
In the old design, unicast and mcast receiver threads first received | ||
the message, then unserialized it and finally passed it on to the next | ||
higher protocol. However, as unserialization takes a long time, the | ||
thread spends too much time in its unserialization routine, whereas it | ||
should be receiving messages. Therefore a lot of messages got dropped | ||
(network buffer overflow) and had to be retransmitted. | ||
|
||
The new design offers an alternative: the ucast and mcast threads just | ||
receive a DatagramPacket and add it to a queue served by the | ||
PacketHandler thread. Then they return to their task of receiving | ||
messages. The PacketHandler thread continuously dequeues messages, | ||
unserializes them and passes them on to the protocol above. | ||
|
||
The advantage of the PacketHandler is that large volumes of messages | ||
are better handled. The disadvantage is that we allocate a new byte | ||
buffer every time a message is received and add it to the queue. In | ||
the old design, there was one buffer for ucast and one for mcast | ||
messages. A receiver thread would receive a message into that buffer | ||
(65K), unserialize off of that same buffer and only after passing the | ||
message up the stack would it return to receiving a new message into | ||
the same buffer. So the advantage is not using up space (reuse of the | ||
same buffer). The new scheme copies the bytes received (e.g. 425 | ||
bytes) into its own separate byte buffer and adds that buffer to the | ||
packet handler's input queue, therefore using more memory, but being | ||
faster overall. | ||
|
||
To enable the PacketHandler set use_packet_handler to true in UDP. | ||
|
||
|
||
|
||
Storing copies of Messages vs references of Messages | ||
---------------------------------------------------- | ||
|
||
- Some protocols were changed May 26 2002 (by Bela) to store | ||
references instead of copies of messages in retransmit tables. This | ||
avoids accumulating large amounts of memory for copies that are not | ||
needed. This change was possible by switching from stack-based to | ||
hashmap-based headers in messages. When a message would have been | ||
stored in a retransmit table by reference rather than copy, each | ||
subsequent retransmission would have added headers to the | ||
stack. With the hashmap-based headers, headers for a given protocol | ||
are just overwritten by the header from that protocol. | ||
|
||
- The caveat is that the buffer (payload), or source or destination | ||
address should not be changed after the message is stored by | ||
reference in the xmit table, because the original message will be | ||
changed as well. | ||
|
||
|
||
|
||
|
||
GMS-less reliable message transmission | ||
-------------------------------------- | ||
|
||
Here are some more points to look at for GMS-less message transmission. | ||
|
||
Looks like such a stack could be FRAG:UNICAST:NAKACK:STABLE. Let me look | ||
at each protocol in turn. | ||
|
||
FRAG: | ||
- No changes; should work out of the box | ||
|
||
UNICAST: | ||
- The first message to a receiver is tagged with first=true. This allows | ||
the receiver to determine the starting seqno | ||
- When a member crashes we don't know about it. Therefore we need to | ||
periodically clean the connection table of unused connection entries. | ||
However, this could remove a connection to an idle (not crashed) sender. | ||
In this case, when it sends the next message, and we don't find an | ||
entry, we'll just return an error to the sender, so the seqno agreement | ||
process starts again. This way, UNICAST is reliable in that we don't | ||
lose any messages | ||
- When a receiver crashes, we don't know about it. Therefore, if the | ||
receiver hasn't acked all messages, we will continue retransmitting | ||
those messages until the end of time. To prevent this, we will close a | ||
connection after N unsuccessful retransmissions (without having received | ||
an ack). | ||
|
||
|
||
NAKACK: | ||
- The first seqno from a sender causes the creation of a | ||
NakReceiverWindow for that member (if it doesn't exist already) | ||
- When a sender crashes, and the receiver hasn't received all missing | ||
messages, the receiver will keep requesting retransmissions of the | ||
missing messages from the sender (forever). Therefore, similar to | ||
UNICAST, we purge senders from which we haven't received message | ||
retransmissions for N attempts (or a certain time). | ||
- Periodically we will purge idle connections. These may either be idle | ||
members, or crashed members, but we have no means of detection. When an | ||
idle member has purged, and sends the next message, a new entry will | ||
simply be created for it. | ||
|
||
STABLE: | ||
- A hard one: how do we know when a message is stable in all receivers | ||
so we can remove it from the retransmission table, when we don't have | ||
membership information ? The simplest scheme is probably just to | ||
periodically purge messages that are older than N seconds. If we | ||
accidentally purge a message that has not yet been seen by all members | ||
(small probability though), we just send an error back to the | ||
retransmitter. The requester could then for example remove the | ||
retransmitter's connection, so retransmission requests would stop. | ||
|
||
|
Oops, something went wrong.