Skip to content

Commit

Permalink
find union
Browse files Browse the repository at this point in the history
  • Loading branch information
emanuele-em committed Nov 1, 2023
1 parent 9dd4ab9 commit f85d073
Show file tree
Hide file tree
Showing 3 changed files with 285 additions and 1 deletion.
2 changes: 1 addition & 1 deletion src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@
- [Binary trees](binary_trees.md)
- [Spanning trees](spanning_trees.md)
- [Kruskal’s algorithm](kruskal_s_algorithm.md)
<!-- - [Union-find structure](README.md) -->
- [Union-find structure](union_find_structure.md)
<!-- - [Prim’s algorithm](README.md) -->
<!-- - [Directed graphs](README.md) -->
<!-- - [Topological sorting](README.md) -->
Expand Down
105 changes: 105 additions & 0 deletions src/kruskal_s_algorithm.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,111 @@ and there is a path between any two nodes.
The resulting graph is a minimum spanning tree
with weight $2+3+3+5+7=20$.

## Why does this work?

It is a good question why Kruskal's algorithm works.
Why does the greedy strategy guarantee that we
will find a minimum spanning tree?

Let us see what happens if the minimum weight edge of
the graph is _not_ included in the spanning tree.
For example, suppose that a spanning tree
for the previous graph would not contain the
minimum weight edge 5--6.
We do not know the exact structure of such a spanning tree,
but in any case it has to contain some edges.
Assume that the tree would be as follows:

<script type="text/tikz">
\begin{tikzpicture}[scale=0.9]
\node[draw, circle] (1) at (1.5,2) {1};
\node[draw, circle] (2) at (3,3) {2};
\node[draw, circle] (3) at (5,3) {3};
\node[draw, circle] (4) at (6.5,2) {4};
\node[draw, circle] (5) at (3,1) {5};
\node[draw, circle] (6) at (5,1) {6};

\path[draw,thick,-,dashed] (1) -- (2);
\path[draw,thick,-,dashed] (2) -- (5);
\path[draw,thick,-,dashed] (2) -- (3);
\path[draw,thick,-,dashed] (3) -- (4);
\path[draw,thick,-,dashed] (4) -- (6);
\end{tikzpicture}
</script>

However, it is not possible that the above tree
would be a minimum spanning tree for the graph.
The reason for this is that we can remove an edge
from the tree and replace it with the minimum weight edge 5--6.
This produces a spanning tree whose weight is
_smaller_:

<script type="text/tikz">
\begin{tikzpicture}[scale=0.9]
\node[draw, circle] (1) at (1.5,2) {1};
\node[draw, circle] (2) at (3,3) {2};
\node[draw, circle] (3) at (5,3) {3};
\node[draw, circle] (4) at (6.5,2) {4};
\node[draw, circle] (5) at (3,1) {5};
\node[draw, circle] (6) at (5,1) {6};

\path[draw,thick,-,dashed] (1) -- (2);
\path[draw,thick,-,dashed] (2) -- (5);
\path[draw,thick,-,dashed] (3) -- (4);
\path[draw,thick,-,dashed] (4) -- (6);
\path[draw,thick,-] (5) -- node[font=\small,label=below:2] {} (6);
\end{tikzpicture}
</script>

For this reason, it is always optimal
to include the minimum weight edge
in the tree to produce a minimum spanning tree.
Using a similar argument, we can show that it
is also optimal to add the next edge in weight order
to the tree, and so on.
Hence, Kruskal's algorithm works correctly and
always produces a minimum spanning tree.

## Implementation

When implementing Kruskal's algorithm,
it is convenient to use
the edge list representation of the graph.
The first phase of the algorithm sorts the
edges in the list in $O(m \log m)$ time.
After this, the second phase of the algorithm
builds the minimum spanning tree as follows:

```rust, ignore
for ... {
if !a.equal_to(b) {a.join(b)}
}
```

The loop goes through the edges in the list
and always processes an edge $a$--$b$
where $a$ and $b$ are two nodes.
Two functions are needed:
the function `equal_to` determines
if $a$ and $b$ are in the same component,
and the function `join`
joins the components that contain $a$ and $b$.

The problem is how to efficiently implement
the functions `equal_to` and `join`.
One possibility is to implement the function
`equal_to` as a graph traversal and check if
we can get from node $a$ to node $b$.
However, the time complexity of such a function
would be $O(n+m)$
and the resulting algorithm would be slow,
because the function `equal_to` will be called for each edge in the graph.

We will solve the problem using a union-find structure
that implements both functions in $O(\log n)$ time.
Thus, the time complexity of Kruskal's algorithm
will be $O(m \log n)$ after sorting the edge list.

___

[^1] The algorithm was published in 1956 by J. B. Kruskal [48].
179 changes: 179 additions & 0 deletions src/union_find_structure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Union-find structure

A **union-find structure** maintains
a collection of sets.
The sets are disjoint, so no element
belongs to more than one set.
Two $O(\log n)$ time operations are supported:
the `unite` operation joins two sets,
and the `find` operation finds the representative
of the set that contains a given element.

## Structure

In a union-find structure, one element in each set
is the representative of the set,
and there is a chain from any other element of the
set to the representative.
For example, assume that the sets are
$\{1,4,7\}$, $\{5\}$ and $\{2,3,6,8\}$:

<script type="text/tikz">
\begin{tikzpicture}
\node[draw, circle] (1) at (0,-1) {1};
\node[draw, circle] (2) at (7,0) {2};
\node[draw, circle] (3) at (7,-1.5) {3};
\node[draw, circle] (4) at (1,0) {4};
\node[draw, circle] (5) at (4,0) {5};
\node[draw, circle] (6) at (6,-2.5) {6};
\node[draw, circle] (7) at (2,-1) {7};
\node[draw, circle] (8) at (8,-2.5) {8};

\path[draw,thick,->] (1) -- (4);
\path[draw,thick,->] (7) -- (4);

\path[draw,thick,->] (3) -- (2);
\path[draw,thick,->] (6) -- (3);
\path[draw,thick,->] (8) -- (3);

\end{tikzpicture}
</script>

In this case the representatives
of the sets are 4, 5 and 2.
We can find the representative of any element
by following the chain that begins at the element.
For example, the element 2 is the representative
for the element 6, because
we follow the chain $6 \rightarrow 3 \rightarrow 2$.
Two elements belong to the same set exactly when
their representatives are the same.

Two sets can be joined by connecting the
representative of one set to the
representative of the other set.
For example, the sets
{1,4,7} and {2,3,6,8}
can be joined as follows:

<script type="text/tikz">
\begin{tikzpicture}
\node[draw, circle] (1) at (2,-1) {1};
\node[draw, circle] (2) at (7,0) {2};
\node[draw, circle] (3) at (7,-1.5) {3};
\node[draw, circle] (4) at (3,0) {4};
\node[draw, circle] (6) at (6,-2.5) {6};
\node[draw, circle] (7) at (4,-1) {7};
\node[draw, circle] (8) at (8,-2.5) {8};

\path[draw,thick,->] (1) -- (4);
\path[draw,thick,->] (7) -- (4);

\path[draw,thick,->] (3) -- (2);
\path[draw,thick,->] (6) -- (3);
\path[draw,thick,->] (8) -- (3);

\path[draw,thick,->] (4) -- (2);
\end{tikzpicture}
</script>

The resulting set contains the elements
{1,2,3,4,6,7,8}.
From this on, the element 2 is the representative
for the entire set and the old representative 4
points to the element 2.

The efficiency of the union-find structure depends on
how the sets are joined.
It turns out that we can follow a simple strategy:
always connect the representative of the
\emph{smaller} set to the representative of the \emph{larger} set
(or if the sets are of equal size,
we can make an arbitrary choice).
Using this strategy, the length of any chain
will be $O(\log n)$, so we can
find the representative of any element
efficiently by following the corresponding chain.

## Implementation

The union-find structure can be implemented
using arrays.
In the following implementation,
the array `link` contains for each element
the next element
in the chain or the element itself if it is
a representative,
and the array `size` indicates for each representative
the size of the corresponding set.

Initially, each element belongs to a separate set:

```rust, ignore
for i in 1..=n {link[i] = i};
for i in 1..=n {size[i] = i};
```

The function `find` returns
the representative for an element $x$.
The representative can be found by following
the chain that begins at $x$.

```rust
fn find(mut x:usize, link: &[usize]) -> usize {
while x != link[x] {x = link[x]}
x
}
```

The function `same` checks
whether elements $a$ and $b$ belong to the same set.
This can easily be done by using the
function `find`:
```rust
# fn find(mut x:usize, link: &[usize]) -> usize {
# while x != link[x] {x = link[x]}
# x
# }
fn same(a: usize, b: usize, link: &[usize]) -> bool {
find(a, link) == find(b, link)
}
```

The function `unite` joins the sets
that contain elements $a$ and $b$
(the elements have to be in different sets).
The function first finds the representatives
of the sets and then connects the smaller
set to the larger set.

```rust
# fn find(mut x:usize, link: &[usize]) -> usize {
# while x != link[x] {x = link[x]}
# x
# }
# fn swap(a: usize, b: usize){
# todo!();
# }
fn unite(mut a: usize, mut b: usize, link: &mut [usize], size: &mut [usize]) {
a = find(a, link);
b = find(b, link);
if size[a] < size[b] {swap(a,b)}
size[a] += size[b];
link[b] = a;
}
```

The time complexity of the function `find`
is $O(\log n)$ assuming that the length of each
chain is $O(\log n)$.
In this case, the functions `same` and `unite`
also work in $O(\log n)$ time.
The function `unite` makes sure that the
length of each chain is $O(\log n)$ by connecting
the smaller set to the larger set.

___

[^1] The structure presented here was introduced in 1971 by J. D. Hopcroft and J. D. Ullman [38].
Later, in 1975, R. E. Tarjan studied a more sophisticated variant of the structure [64] that is discussed in many algorithm textbooks nowadays.

0 comments on commit f85d073

Please sign in to comment.