ns

belaban · May 20, 2006 · bb514f0 · bb514f0
1 parent da0a34a
commit bb514f0
Show file tree

Hide file tree

Showing 2 changed files with 761 additions and 0 deletions.
diff --git a/doc/design/Misc1.txt/DESIGN b/doc/design/Misc1.txt/DESIGN
@@ -0,0 +1,283 @@
+$Id: DESIGN,v 1.1 2006/05/20 21:42:13 belaban Exp $
+
+
+
+
+
+			    Design Issues
+			    =============
+
+
+
+
+
+Correct determination of left members (UNICAST and others)
+----------------------------------------------------------
+
+In UNICAST we have a connection hashtable which stores the sender's or
+receiver's address and an entry for the connection to/from that
+address. Whenever a view change is received, we remove members from
+that connection table that are not in the view. However, this
+introduces the following bug.
+
+Consider the case where we have 2 members in the group: {A, B}. Now C
+and D want to join simultaneously. Let's assume C is joined first,
+then D (JGroups doesn't join more than 1 member at a time). This
+will result in view {A, B, C} which is sent to A, B and C (not to D
+since D is not a member yet). Let's also assume that there is already
+an entry for D in A's connection table: if A is the coordinator, this
+entry is the result of D's JOIN request sent to A. Let's assume D's
+JOIN request has a seqno=1.
+
+When A receives the new view {A, B, C}, it will remove D's entry in
+its connection table BECAUSE IT THINKS THAT D HAS LEFT THE GROUP. Now
+D resends its JOIN request (with seqno=2). Since this unicast message
+is not tagged as first message, A will reject it. (Note that D does
+not retransmit the message with seqno=1 because A ack'ed it. Receivers
+ack all messages received even though the messages might not be
+accepted by their connection table).
+
+To solve this problem, we have to determine whether D really left the
+group, or whether it is just a new member not yet seen by the
+group. To do this, we maintain a 'group history' in a bounded cache
+which contains the last n members of the group. E.g. when C joins the
+cache is {A, B}. Before A removes D's entry from its connection table,
+it checks whether D actually was a member. Since D is not in {A, B},
+its entry will not be removed.
+
+
+This has been modified (bela dec 23 2002):
+Now we do not remove an entry from the connection table in
+retransmit(), and only remove it in a VIEW_CHANGE when the diff
+between 2 view shows that a member left (see determineLeftMember in
+org.jgroups.util.Util). For example: first view is {A,B,C}, next
+view is {A,B}, then left member is C.
+The use case is when 2 or more members merge into a single group,
+e.g. A and B. Let's say A is the merge coord. At this point A needs to
+send a unicast message to B which is not member in A's view. Whenever
+we received a view change or retransmission request in A or B, each
+one would remove the other's connection entry from the connection
+table, therefore preventing the merge protocol from completing.
+
+
+
+Setting of recv/send buffer sizes in UDP (FRAG problem)
+-------------------------------------------------------
+
+There was a bug in FRAG which caused fragmentation of larger messages
+to receive only the first fragment of each message, and not further
+fragments. The reason was that the default send buffer size of UDP was
+8195: when a larger message (e.g. 30000 bytes) was to be sent, it was
+fragmented into 4 packets (3 * 8195 and 1 * 5415 bytes). All 4
+fragments were then passed down to UDP almost simultaneously which
+caused the UDP send buffer to overflow which resulted in fragments 2,
+3 and 4 to be discarded. This explains why retransmission of fragments
+always correctly received the packets. Also, if there was a small
+timeout (e.g. 5ms) between the fragments passed down to UDP, *all* the
+fragments would be received.
+
+The solution is therefore to set the mcast_send_buf_size,
+mcast_recv_buf_size, ucast_send_buf_size and ucast_recv_buf_size in
+UDP explicitly, based on the frag_size value of FRAG.
+
+Note that the above problem did not cause messages to be dropped, but
+added a number of spurious retransmissions.
+
+
+
+Use of multiple sockets in UDP
+------------------------------
+
+There are currently 3 sockets in UDP:
+
+      - a MulticastSocket for sending and receiving  multicast packets
+      - a DatagramSocket for sending unicast packets and
+      - a DatagramSocket for receiving unicast packets
+
+Normally there would only be 2 sockets: a socket for sending and
+receving unicast packets and a socket for sending and receiving
+multicast packets. The reason for the additional unicast send socket
+is that under Linux, when a packet is sent to a non-existent
+(e.g. crashed) receiver, there will be an asynchronous ICMP
+HOST-UNREACHABLE sent back to the sender. However, since it is
+asynchronous, it will not cause an exception in the current send(),
+but in *one of the following ones* ! Therefore, we need to close and
+re-open the send socket every time.
+
+This feature is due to the fact that the older JDK for Linux from
+Blackdown did *not* specify BSD compatibility when creating
+DatagramSockets, which meant that ICMP HOST-UNREACHABLE errors were
+actually progagated up to the interface (send()), and not just
+discarded, as would be expected.
+
+Newer JDKs (e.g. SUN's JDK 1.2.2 and 1.3 for Linux) don't seem to have
+this feature, e.g. they seem to have BSD_COMPAT set to true). An item
+on the todo list is to remove the code which closes and reopens the
+send socket (search for is_linux in UDP.java) and test. This way, we
+would have only 2 sockets and didn't need to close/reopen the socket
+on every unicast send.
+
+
+
+Why not merge all sockets into one, e.g. a MulticastSocket, for
+sending/receiving unicast as well as multicast packets ?
+--------------------------------------------------------
+
+
+All MulticastSockets need to use the same IPMCAST address and port to
+be able to receive multicast packets. However, in JGroups, every
+member also needs to have its own address/port to receive unicast
+packets. In this case, this wouldn't be true, because the
+MulticastSocket only has one port, and that would be the IPMCAST
+port. Therefore, returning a unicast response to a multicast request
+would be impossible because senders cannot be discerned (they all have
+the same port). Example: sock=new MulticastSocket("myhost", 7500). In
+this case, sock.getLocalPort() returns 7500, which is true for *every
+member in the group*.
+Update (bela Sept 12 2006): even if we don't choose the mcast socket's port
+(new MulticastSocket()), when sending on multiple
+interfaces (send_on_all_interfaces="true"), we receive a different sender
+address for each message sent on a different NIC. This needs to be handled with
+*logical addresses* (see JIRA for 2.3).
+
+
+
+
+Sending of dest_addr and src_addr in Message
+--------------------------------------------
+
+There was an idea of removing the src_addr and dest_addr from the
+serialized data exchanged between UDP protocols, which would reduce
+the size of packets. The idea was that the DatagramPacket itself would
+contain the sender's address (src_addr), and the receiver's address
+would be the one to which the packet was sent to. However, there are a
+number of problems.
+
+First, since we cannot have just one socket (see above), a receiver of
+a unicast and a multicast packet by the same sender S would see 2
+different sender addresses (assuming for now, that we have 1 multicast
+and 1 unicast socket). Let's assume that the sender's port of the
+unicast socket is 5555 and its multicast port 7500. The receiver would
+receive a packet from S:5555 and one from S:7500, which are 2
+different addresses although coming from the same sender.
+
+Second, a multicast sent to a number of receivers cannot be replied
+to: a receiver sees that the sender's port is 7500 and replies to that
+port. However, that port is used by the multicast group, therefore the
+sender of the multicast will not receive a unicast response.
+
+Third, a member always needs a unique <addr:port> IpAddress as local
+address. In the above case, we would always have 2 different
+IpAddresses, which causes problems identifying the member. Note that
+IpAddresses are used as keys into various structures, e.g. in
+NakReceiverWindow.
+
+
+
+
+UDP PacketHandler
+-----------------
+
+In the old design, unicast and mcast receiver threads first received
+the message, then unserialized it and finally passed it on to the next
+higher protocol. However, as unserialization takes a long time, the
+thread spends too much time in its unserialization routine, whereas it
+should be receiving messages. Therefore a lot of messages got dropped
+(network buffer overflow) and had to be retransmitted.
+
+The new design offers an alternative: the ucast and mcast threads just
+receive a DatagramPacket and add it to a queue served by the
+PacketHandler thread. Then they return to their task of receiving
+messages. The PacketHandler thread continuously dequeues messages,
+unserializes them and passes them on to the protocol above.
+
+The advantage of the PacketHandler is that large volumes of messages
+are better handled. The disadvantage is that we allocate a new byte
+buffer every time a message is received and add it to the queue. In
+the old design, there was one buffer for ucast and one for mcast
+messages. A receiver thread would receive a message into that buffer
+(65K), unserialize off of that same buffer and only after passing the
+message up the stack would it return to receiving a new message into
+the same buffer. So the advantage is not using up space (reuse of the
+same buffer). The new scheme copies the bytes received (e.g. 425
+bytes) into its own separate byte buffer and adds that buffer to the
+packet handler's input queue, therefore using more memory, but being
+faster overall.
+
+To enable the PacketHandler set use_packet_handler to true in UDP.
+
+
+
+Storing copies of Messages vs references of Messages
+----------------------------------------------------
+
+- Some protocols were changed May 26 2002 (by Bela) to store
+  references instead of copies of messages in retransmit tables. This
+  avoids accumulating large amounts of memory for copies that are not
+  needed. This change was possible by switching from stack-based to
+  hashmap-based headers in messages. When a message would have been
+  stored in a retransmit table by reference rather than copy, each
+  subsequent retransmission would have added headers to the
+  stack. With the hashmap-based headers, headers for a given protocol
+  are just overwritten by the header from that protocol.
+
+- The caveat is that the buffer (payload), or source or destination
+  address should not be changed after the message is stored by
+  reference in the xmit table, because the original message will be
+  changed as well.
+
+
+
+
+GMS-less reliable message transmission
+--------------------------------------
+
+Here are some more points to look at for GMS-less message transmission.
+
+Looks like such a stack could be FRAG:UNICAST:NAKACK:STABLE. Let me look
+at each protocol in turn.
+
+FRAG:
+- No changes; should work out of the box
+
+UNICAST:
+- The first message to a receiver is tagged with first=true. This allows
+  the receiver to determine the starting seqno
+- When a member crashes we don't know about it. Therefore we need to
+  periodically clean the connection table of unused connection entries.
+  However, this could remove a connection to an idle (not crashed) sender.
+  In this case, when it sends the next message, and we don't find an
+  entry, we'll just return an error to the sender, so the seqno agreement
+  process starts again. This way, UNICAST is reliable in that we don't
+  lose any messages
+- When a receiver crashes, we don't know about it. Therefore, if the
+  receiver hasn't acked all messages, we will continue retransmitting
+  those messages until the end of time. To prevent this, we will close a
+  connection after N unsuccessful retransmissions (without having received
+  an ack).
+
+
+NAKACK:
+- The first seqno from a sender causes the creation of a
+  NakReceiverWindow for that member (if it doesn't exist already)
+- When a sender crashes, and the receiver hasn't received all missing
+  messages, the receiver will keep requesting retransmissions of the
+  missing messages from the sender (forever). Therefore, similar to
+  UNICAST, we purge senders from which we haven't received message
+  retransmissions for N attempts (or a certain time).
+- Periodically we will purge idle connections. These may either be idle
+  members, or crashed members, but we have no means of detection. When an
+  idle member has purged, and sends the next message, a new entry will
+  simply be created for it.
+
+STABLE:
+- A hard one: how do we know when a message is stable in all receivers
+  so we can remove it from the retransmission table, when we don't have
+  membership information ? The simplest scheme is probably just to
+  periodically purge messages that are older than N seconds. If we
+  accidentally purge a message that has not yet been seen by all members
+  (small probability though), we just send an error back to the
+  retransmitter. The requester could then for example remove the
+  retransmitter's connection, so retransmission requests would stop.
+
+