# Disjoint Set
* organize objects into non-overlapping sets. 
* test if two objects were in the same set and merge sets together
* stores a collection of (dynamic) sets of elements, which are disjoint from each other
* each element only belongs to one set

* **Data**:
    * A collection of non-empty disjoint sets $S = \{S_1, S-2, ..., S_k\}$ where each set is identified by a unique element called its representative

* **Operations**:
    * **Make-Set(x)**: Given element x that does not already belong to one of the sets in the collection, create a new set {x} that contains only x. Assign x as the representative of that new set. 
    * **Find-Set(x)**: Given element x, return the representative of the set that contains x (or NIL if x does not belong to any set).
    * **Union(x, y)**: Given two distinct elements x and y, let $S_x$ be the set that contains x and $S_y$ be the set that contains y. Form a new set consisting of $S_x \cup S_y$ and remove $S_x$ and $S_y$ from the collection. Pick a representative for the new set. As a pre-condition, it is required that x and y each be an element of some set in the collection
* **Important**:
    * **Disjoint** 
        * means they do not have any common elements
        * no element can occur in more than one of the sets
    * **Non-empty**
        * at least one element in the set
* **Applications**:
    * KRUSKAL-MST
    * Finding connected components of a graph
        

## General assumptions:
* disjoint set tracks the sets not the elements
* each element have some reference to the location (node) of x in the disjoint set data structure

### 7 ways of implementation:
1. Circular Linked Lists $\theta(m^2)$
2. Linked List with Extra .rep Pointer $\theta(m^2)$
3. Linked List with union-by-weight $\theta(m log m)$
4. Inverted Trees $\theta(m^2)$
5. Inverted Trees with union by weight $\theta(m log m)$
6. Inverted Trees with path comparison $\theta(m log m)$
7. Inverted Trees with union-by-rank and path comparession $\theta(m log * n)$

## 1. Circular Linked Lists
* Each set: one circular linked list 
* Head of the linked list also serves as the representative
* each list circular: last element connected to first element
* each node has a boolean flag: rep: T if node is rep, F O.W

* **Operation**:
    * MakeSet(x): just a new linked list with a single element x
        * Running Time: O(1)
    * FindSet(x): follow the links until reaching the head
        * Running Time: $\theta$(n)
    * Union(x, y)
        1. Locate the head of each linked-list by calling FindSet, takes $\theta$(L)
        2. Exchange the two heads'next pointers, takes $\theta$(1)
        3. Keep only one representative for the new set
* **Worst Case Runtime**:
    * FindSet is the time consuming operation
    * The total cost of a sequence of m operations
    * Upper bound: 
        * each op operates on a data structure where each list contains $\leq$ m elements
        * each op takes O(m) traverse
        * O($m^2$)
    * Lower bound:
        * limited m operations
        * $\frac{m}{3}$ Make-Set &rarr; $\theta$(m)
        * $\frac{m}{3} - 1$ unions &rarr; $\theta$(m)
        * $\frac{m}{3} + 1$ Find-sets &rarr; $\theta$(m)
        * $\theta((\frac{m}{3})^2)$
    * m operations
        * where m/3 operations makeset, where it is $\theta(m)$
        * where m/3 unions, where it is $\theta(m)$
        * m/3 findsets, where it is $\theta(\frac{m^2}{3})$

## 2. Linked List with extra pointer (to head)
* **Operation**:
    * MakeSet(x): takes O(1)
    * FindSet(x): takes O(1), since we can go to head in 1 step, better than circular linked list
    * Union(x, y):
        * Idea: append one list to the other, then update the pointers to head
        * Append takes O(1) time
        * Update pointers take O(L of appending list)
* **Worse Case Runtime**:
    * MakeSet and FindSet are fast, Union now becomes the time-consuming one, especially if appending a long list
    * The total cost of a sequence of m operations
    * Upper Bound:
        * $O(m^2)$
    * Lower Bound:
        * $\Omega(m^2)$
    * Total cost: $\theta(1 + 2 + 3 + ... + \frac{m}{2} - 1) = \theta(m^2)$
    
    

## 3. Linked List with union-by-weight
* **Operation**:
    * <span style="background-color: yellow">append the shorter one to the longer one</span>
    * need to keep track of the size(weight) of each list, therefore called union by weight
* **Worse Case Runtime**:
    * For any sequence of m operations, where we have n Make-set are performed, such that n elements in total.
    * The total cost is O(m + nlogn)
* **Proof**:
    * Consider an arbitrary element x, how many times does its head pointer need to be updated?
    * Because union-by-weight, when x is updated, it must be in the smaller list of the two. In other words, after union, the size of list at least doubles
    * That is, every time x is updated,set size doubles
    * there are only n elements in total, so we can double at most O(log n) times, i.e. x can be updated at most O(log n)
    * Same for all n elements, so total updates O(n log n)

## 4. Trees
* Each set is an "inverted" tree
    * each element keeps a pointer to its parent in the tree
    * the root points to itself (test root by x.p = x)
    * the representative is the root
    * not necessarily a binary tree or balanced tree
* **Operations**:
    * MakeSet(x): create a single-node tree with root x
        * Running Time: O(1)
    * FindSet(x): Trace up the parent pointer until the root is reached
        * Running Time: O(height of tree)
    * Union(x, y)
        1. Call FindSet(x) and FindSet(y) to locate the representatives, O(h)
        2. Let one tree's root point to the other tree's root, O(1)
    * Benchmarking: runtime
        * The worst-case sequence of m operations. (with FindSet being the bottleneck)
        * $\frac{m}{4}$ MakeSets, $\frac{m}{4} - 1$ Union, $\frac{m}{4} + 1$ FindSet
        * The total cost in worst-case sequence: $\theta$(m^2)
        * Each FindSet would take up to $\frac{m}{4}$

## 5. Trees with union-by-rank
* **Intuition**:
    * FindSet takes O(h), so the height of tree matters
    * To keep the unioned tree's height small, we should let the taller tree's root be the root of the unioned tree
1. A node's rank is the same as its height, but it will be different later
2. When union, let the root with lower rank point to the root with higher rank
3. If the two roots have the same rank, choose either root as the new root and increment its rank
* **Benchmarking: runtime**
    * It can be proven that, a tree of n nodes formed by union-by-rank has height at most log n, which means FindSet takes O(log n)
    * for a sequence of $\frac{m}{4}$ MakeSets, $\frac{m}{4} -1$ Union, $\frac{m}{2} + 1$ FindSet operations, the total cost is O(mlogm)
    * rank of a tree with n nodes is at most log n, e.g. $r(n) \leq log n$
* **Proof**
    * Induction: n(r) $\geq 2^r$
    * Base Step: if r = 0 (single node), n(0) = 1, TRUE
    * Inductive Step: assume n(r) $\geq$
        * a tree with root rank r + 1 is a result of unioning two trees with root rank r, so
        * n(r+1) = n(r) + n(r) $\geq 2 \times 2^r = 2^{r+1}$

## 6. Trees with path compression
* **Compress**:
    * By pointing directly to the head
    * Extra cost to FindSet: at most twice the cost, so does not affect the order of complexity
* **Runtime**:
    * For a sequence of operations with n MakeSet (so at most n-1 Union), and k FindSet, the worst case total cost of the sequence is in
    *$\theta ( n+k \times (1+log_{2+\frac{k}{n}}n)$
    * For a sequence of $\frac{m}{4}$ MakeSets, $\frac{m}{4} - 1$ Union, $\frac{m}{2} + 1$ FindSet, the worse case total cost is in $\theta(m log m)$
    

## 7. Trees with union by rank
* **Intuition**:
    * path compression happens in the FindSet operation
    * union by rank happens in the union operation (outside FindSet)
* **Runtime**:
    * For a sequence of m operations with n MakeSet (so at most n-1 Union), worst-case total cost of the sequence is O(m log* n)
    * log * n is equal to the number of times the log function must be ietratively applied so the result is at most 1