# Kruskal's MST

Overview:
 - Competitive with Prim's in both theory and practice; greatest hit baybee
 - New Data Structure - Union-Find Data structure. O(mlogn) with this struct
 - Clustering and such

MST Review:
 - Input undirected graph G, each edge with cost Ce
 - Output - Min-Cost spanning tree (no cycles, completely connected)
 - Assume G is connected, distinct edge costs (tho ties don't break algo)
 - Cut Property - if e is thecheapest edge crossing some cut, then e belongs to MST

Kruskals Idea:
 - Just look at cheapest edge in graph, add to collection. Continue
     - Exclude if creates a cycle
     - Edges can be disjoint, do not need to maintain connectivity of subgraph at each individual step

KruskalMST(Graph):
 - sort edges in order of increasing edge cost (rename edges 1,2,3,4...m, C1 < C2... <Cm)
 - T = None (edges, tree in progress)
 - for edge in all edges to m:
     - If T(edge) has no cycles
         - add edge to T
 - return T
 - Really, can run as while loop to count number of vertices instead. 

### Correctness

Theorem: Kruskal's is correct

Same as Prims, prove outputs a tree, then prove outputs smallest

Proof: Let T* = output of Kruskal's algorithm on input graph G
 - Clearly, T* has no cycles based on pseudocode
 - T* is connected, why?
     - By EmptyCut Lemma, only need to show that for every cut, at least one crossing edge. Show T* crosses every cut. 
     - Fix a cut, (A,B). Assuming that input G is connected, G contains at least one edge that crosses this cut.
         - Key Idea: Kruskal's considers each edge one time. When Kruskal's first considers an edge crossing A,B, this edge will definitely be included in T*. 
         - This first edge cannot be a cycle by Lonely Edge Corollary. Canot choose cycles. This edge is guaranteed to be chosen. Thus, at least one edge of Kruskal's output crossing this particular cut A,B. Bc A,B is arbirary, all edges/cuts have some edge of T* crossing them.
     - Thus, Kruskal's outputs a spanning tree
 - Every edge of T* justified by the Cut Property. This means T* is the MST
     - Consider iteration where edge (u,v) added to current set T. At this intermediate, can have multiple separate connected components and isolated vertices
     - Since T U (u,v) has to cycle, T has no current path between u and v, so in diff pieces. 
         - Thus, there exists empty cut (A,B) separating u and v (as in empty cut lemme)
         - Can find cuts A and B with u on one side, v on other side, no cut in between two (looking at T)
         - This added edge not only crosses A,B, but is also the cheapest cut. Recall, Kruskal's always adds first edge it encounters that crosses cuts (w/o producing cycle). This is because Kruskal's considers lowest cost edges first. 
         - Thus, (u,v) is justified by the Cut Property
     

### Implementing  via Union-Find

KruskalMST(Graph):
 - sort edges in order of increasing edge cost (rename edges 1,2,3,4...m, C1 < C2... <Cm)
 - T = None (edges, tree in progress)
 - for edge in all edges to m:
     - If T(edge) has no cycles
         - add edge to T
 - return T
 - Really, can run as while loop to count number of vertices instead.
 
Straightforward Implementation:
 - First sorts edges in order of increasing edge cost, O(mlogmn)
     - logm and logn interchangeable in big O. m <= n^2. If log(m) -> log(n^2) -> 2logn -> O(logn)
 - For-loop: O(m) iterations
     - if statement: O(n) time to check for cycle. Checks if u,v path exists for (u,v) edge. Start at U, see if reach v or not basically. Takes time linear in graph. Only look at edges in T, at most N - 1 edges. 
         - Note, cannot just check if both vertices in X. Remember, Kruskal's doesnt maintain connected components.
 - So, overall, O(mn) since for-loop dominates. 
 
Union-Find allows looking for a cycle in constant time. Then, sorting dominates running time bc constant time for each iteration of while loop (O(m)). Runtime goes to O(mlogn).

**Union-Find**
 - Fairly primitive version here and not incredibly extensive discussion. Better implementations -> better runtime.
 - Maintains a partition of a set of objects (C1, C2, C3, C4 disjionted subsets that in union comprise entire set).
 - Operations:
     - Find(X) - return name of group that x belongs to
     - Union(Ci, Cj) - fuse groups Ci, Cj into a single one.
 - For Kruskal's
     - At beginning, each vert is own component.
     - When components added, effectively combining components. 
     - Objects in Data = Vertices
     - Groups = Connected components currently chosen formed by edges in T
     - Adding new edge (u,v) to T is fusing connected components of u, v
     
Basics, Motivation O(1) time cycle checks in Kruskals:
 - Idea 1 - maintain one linked structure per connected component of V,T
     - Each vertex of graph has extra pointer basically
     - Each component has an arbitrary vertex that's the "leader"
 - Invariant - each vertex points to the leader of its connected component (basically component ID)
     - Given 2 vertices, u,v, check if have same leader. If same leader, adding u,v will create cycle
     - Comparing these two leaders is constant time.
     - Checking cycles then is O(1) for edge (u,v) (Find(u) == Find(v)) iff cycle check is true, will create cycle
 - Maintaining Invariant:
     - When new edge (u,v) added, fuses two connected components. Need to update leader pointers
     - In worst case, need to update O(n) pointers. I.e., fuse 2 sets, each of n/2 size
 - Idea 2 - when two components merge, new union inherits leaders of one of the two components
     - Keep leader from the larger of the two components (so rewire less components); smaller component inherits
     - Can augment Union-Find, maintain size field for each group. Can check constant time the population of two groups
     - ith this, still need to update O(n) pointers lmfao. 
     - Updating Leader Pointers:
         - With vertex-centric view, how many times does a single vertex have its leader pointer update over the course of Kruskals? Will update at most O(logn) times. 
             - Lets say vertex in C size = 20. If updating vertex leader, other C has to be at >= 20. So, new union size is at least 2x original size. Can only happen <= log2(n) times. 

Running Time Analysis:
 - O(mlogn) for sorting
 - O(m) time for cycle checks, (O(1)) per iteration
 - O(nlogn) time overal for leader pointer updates (when combining, this is for all vertices since each vertex is O(logn)
 - O(mlogn) for sorting dominates. Overall, O(mlogn) 

### MSTs: State-of-the-Art and Open Questions

Question: Better than O(mlogn) running time for MSTs?

Answer: Yes. 
 - O(m) randomized algorithm. [Karger-Klein-Tarjan JACM 1995]
 - Do not know if there is a linear time deterministic algorithm that runs O(m). 
 - O(m alpha(n)) deterministic algorithm exists; alpha(n) is the "Inverse Ackerman Function"
     - Really close to linear time. Very slow-growing. 
     - Grows slower than Log* n = # of times you can apply log to n until result drops below 1. (Inverse of "tower function 2^2^2^2). 
 - And more, not finished