# The Minimum Spanning Tree Problem

**Objectives**

- Introduce students to the graph theoretic concept of spanning
  trees.
- Show three different combinatorial algorithms for solving the
  minimum spanning tree problem.
- Demonstrate a practical use of minimum spanning trees.

**Reading:** Read Handout 4 on the minimum spanning tree problem.

**Brief description:** In this lab, we review some of the
applications of the minimum spanning tree problem, along with the
concept of the spanning tree in an undirected graph (and why these are
the desired solutions for the problem), some algorithms for solving
the minimal spanning tree (MST) problem, and sensitivity analysis for
this problem.

<font color='blue'> <b>Solutions are shown blue.</b> </font> <br>
<font color='red'> <b>Instuctor comments are shown in red.</b> </font>

In [None]:
# Imports -- don't forget to run this cell!
import networkx as nx
import vinal as vl
import pandas as pd
import pickle
from IPython.display import Image
from bokeh.io import output_notebook, show
output_notebook()

## Part I: An Application: Communication Network Design

You are the engineer in charge of designing a new high speed fiber optic Internet network between several Operations Research departments throughout the U.S.. Your objective is to design a system that connects various campuses. However, so that this network can be brought online quickly, we must install the fiber optic line within existing physical infrastructure. The possible physical cable routes between cities and the cost of installing the fiber optic cable (in millions of dollars) are given in two CSV files.

In [None]:
# Load the data and create a graph G
nodes = pd.read_csv('data/fiber_optic_nodes.csv', index_col=0)
edges = pd.read_csv('data/fiber_optic_edges.csv', index_col=0)
G = vl.create_network(nodes, edges)

# Plot the graph G
show(vl.tree_plot(G, tree=[], width=600))

How do you suppose you would go about designing such a system? Since you can only use the edges shown in the graph, you must choose a subgraph of the given graph, or in other words, a subset of the possible edges. Every location must be serviced which means that the subgraph must be spanning. You should be able to get to any location from any other location. This means the subgraph should be connected. Because you are trying to minimize cost, the subgraph should also be minimal, meaning that you cannot remove any of the edges while maintaining the other necessary properties. A minimal connected spanning subgraph is called a spanning tree. There are many other ways of defining trees. In operations research terminology, we want to find a minimum spanning tree of the given graph.

In [None]:
tree = [(0,14),(0,1),(1,2),(2,3),(3,5),(4,5),(11,5),(5,6),
        (6,15),(9,10),(7,9),(7,8),(6,7),(14,12),(12,13)]
show(vl.tree_plot(G, tree=tree, show_cost=True, width=900, height=600))

**Q:** An example of a spanning tree is indicated above using thick edges. Do you think that this is the best possible? Can you briefly give a convincing argument why or why not?

**A:** <font color='blue'> No, you could use edge (3,4) instead of (3,5) and have a cheaper spanning tree. </font>

## Part II: Minimum Spanning Tree Algorithms

In this section we will investigate three different algorithms for solving the Minimum Spanning Tree Problem. First you will try the algorithm that you have seen on lecture: Prim’s algorithm. This algorithm works as
follows:

1. Choose any node from which to begin, say node 1, and start the tree with the cheapest edge from node 1 to one of the other nodes in the graph.
2. At each subsequent step, add the cheapest edge that maintains
connectivity of the current tree and adds a new node. In other words,
add the cheapest edge that connects a new node to those already
connected.
3.  Continue to do this until all nodes are connected.

In [None]:
show(vl.tree_plot(G, tree=[], width=600))

**Q:** Run the first 5 iterations of this algorithm (starting at node 0 on the far left). Show your work: indicate the order in which you add the edges.

**A:** <font color='blue'> Add edges in this order: (0,1), (1,2), (2,3), (3,4), (4,5).</font>

Now, use the software to run the first 5 iterations of Prim's. Run the cell to output a plot. The starting node is solid dark blue. Select the edges you identified in **Q2** to add them to the minimum spanning tree.

Note: To select an edge, click on it, and then keep your cursor on the edge till the newly connected node turns dark blue.

In [None]:
show(vl.assisted_mst_algorithm_plot(G, algorithm='prims', s=0, width=900, height=600))

**Q:** If you made a mistake, an error message would have appeared. Did you make any mistakes?

**A:** <font color='blue'> Answers will vary.</font>

**Q:** There are two different colors of nodes in the above graph. Using your observations about the first 5 iterations, explain when each represents.

**A:** <font color='blue'>Solid dark blue are visited by the current tree and the others are unvisited.</font>

**Q:** At each iteration of Prim's, among which subset of edges are you selecting a next edge to add. How do you select which one of these to add?

**A:** <font color='blue'>The edges that go from a visited node to an unvisited node. You select the cheapest one.</font>

**Q:** Run the cell below and click next until you have a minimum spanning tree. Try to anticipate each of the algorithm’s steps. What is the total cost of this tree, and how does this compare with the original tree? (Hint: the cost of the tree is shown in the bottom left.)

**A:** <font color='blue'>The total cost is 103 compared to the cost of 137 for the original tree.</font>

In [None]:
show(vl.mst_algorithm_plot(G, algorithm='prims', i=0, width=900, height=600))

Now we turn to another algorithm which is called Kruskal's
Algorithm. This is an example of a so- called Greedy Algorithm,
i.e. an algorithm that always takes the step that "looks best"
currently. Greedy algorithms are widely used in computer science
beyond just solving the MST. Kruskal's algorithm works as follows:

1. Begin with the cheapest edge (break ties arbitrarily).
2. At each step, add the cheapest edge not already in the system that does not create a cycle, or a loop in the system.
3. Continue adding edges until you get a spanning tree.

In [None]:
show(vl.tree_plot(G, tree=[], width=600))

**Q:** Run the first 5 iterations of this algorithm. Show your work: indicate the order in which you add the edges and also indicate (differently) any edges that you considered adding (but decided not to because they created a cycle or loop).

**A:** <font color='blue'> Add edges in this order: (7,8), (9,8), **(9,7)**, (7,6), (15,6), **(15,9)**, (11,10) where bold edges were considered but not added.</font>

Run the cell below.

In [None]:
show(vl.assisted_mst_algorithm_plot(G, algorithm='kruskals', width=900, height=600))

**Q:** Verify your by-hand computations were correct. Did you make any mistakes?

**A:** <font color='blue'> Answers will vary.</font>

**Q:** Run the cell below and click next until you have a minimum spanning tree. Again, try to anticipate each of the algorithm’s steps. What is the cost of the final solution?

**A:** <font color='blue'> 103</font>

In [None]:
show(vl.mst_algorithm_plot(G, algorithm='kruskals', width=900, height=600))

**Q:** Is the same spanning tree found by the two algorithms?

<font color='blue'>No.</font>

**Q:** (Why is the previous question not a dumb question?)

<font color='blue'>Both are minimum spanning trees but there can be multiple spanning trees that have the same cost.</font>

The following algorithm is called the Reverse Greedy Algorithm, and it
effectively does Kruskal's in reverse.

1. Start with the entire graph.
2. At each step, check if the graph has a cycle. If it does, remove the most expensive edge in the cycle (break ties arbitrarily, and pick any cycle you'd like).
3. Continue to do this until the graph remaining is a spanning tree.


In [None]:
show(vl.tree_plot(G, tree=[], width=600))

**Q:** Run the first 5 iterations of this algorithm. Show your work: indicate the order in which you delete the edges.

**A:** <font color='blue'> Delete edges in this order: (2,11), (0,14), (2,14),(1,14),(11,3) </font>

Run the cell below and click next until you have a minimum spanning tree. Note: this software works a little bit different than the above process, and it always looks for the most expensive edge it can eliminate.  It will run a version of Reverse Kruskal's, but won't run in the full generality as above.

In [None]:
show(vl.mst_algorithm_plot(G, algorithm='reverse_kruskals', width=900, height=600))

**Q:** What is the cost of the resulting spanning tree?

**A:** <font color='blue'> 103 </font>

**Q:** How does the spanning tree and its cost compare to those obtained by the previous algorithms?

**A:** <font color='blue'> They all found spanning trees of the same cost. </font>

**Q:** Which of the three algorithms was the easiest for you to follow? Why?

**A:** <font color='blue'> Answers will vary. </font>

## Part III: Analyzing Minimum Spanning Trees

Suppose you start with some spanning tree, like the first one given in your lab handout, can you devise a way to systematically improve it? In other words, given a spanning tree, can you tell if it is one of minimum cost and, if not, can you improve it (without recomputing a minimum spanning tree from scratch)? 

**Q:** Suggest such an algorithm.

**A:** <font color='blue'> Take an edge of the spanning tree. Remove it. You now have two connected sets of nodes. Look for a cheaper edge spanning these two sets. If there is one, add it back instead of the original edge. If you can not do this for any edge, you have a minimum spanning tree.</font>

Next we will study how the solution changes when problem parameters are altered. This is referred to as Sensitivity Analysis. Consider edge {3, 5} which was not used in the minimum spanning tree found by Prim’s algorithm. Just this one edge’s cost will be changed.

In [None]:
show(vl.tree_plot(G, tree=[(3,5)], width=600))

**Q:** Should it increase or decrease if it will be included in the new minimum spanning tree? Exactly what must the cost of {3, 5} be changed to for this to occur.

**A:** <font color='blue'> It must *decrease* to 14.</font>

We can use the command below to get the weight of any edge:

In [None]:
G[2][3]['weight']

**Q:** Change the weight of {3,5} to be 1 less than the cost you answered in **Q17.**

In [None]:
# TODO: Change the weight of edge {3,5}
# G[3][5]['weight'] = ?

### BEGIN SOLUTION
G[3][5]['weight'] = 13
### END SOLUTION

Run the cell below to re-run Prim's

In [None]:
mst = vl.prims(G,i=0)
show(vl.tree_plot(G, tree=mst, i=0, width=900, height=600))

**Q:** Is the edge {3,5} now in the tree?

**A:** <font color='blue'> Yes.</font>

**Q:** Change the weight of {3,5} to be 2 more than it currently is.

In [None]:
# TODO: Change the weight of edge {3,5}
# G[3][5]['weight'] = ?

### BEGIN SOLUTION
G[3][5]['weight'] = 15
### END SOLUTION

In [None]:
mst = vl.prims(G,i=0)
show(vl.tree_plot(G, tree=mst, i=0, width=900, height=600))

**Q:** Is the edge {3,5} no longer in the tree?

**A:** <font color='blue'> Yes.</font>

**Q:** In general, what does this suggest about how the cost of any one "not included" edge must be changed if it is to be included in the minimum spanning tree for the modified data?

**A:** <font color='blue'> Consider the edges in the cycle created by adding the "not included" edge. The cost of the "not included" edge must become as cheap as the cheapest edge in this cycle.</font>

**Q:** Now consider an edge that is in the minimum spanning tree, such as {1, 2}. How must this be changed for this edge to be forced out of the optimal solution? Again, also figure out the general rule for forcing minimum spanning tree edges out of the minimum spanning tree.

**A:** <font color='blue'> Its cost must increase to be 19 or greater. In general, remove the edge and consider all the edges the span the disconnected subsets of nodes. The cost of the edge must increase to cost of the next smallest edge for it to be forced out of the MS.</font>

## Part IV: MST Application to Clustering

We can also use the Minimum Spanning Tree algorithms to find clusters in data. A cluster is a subset of the data so that the data points in the subset are similar to each other, and different from data points not in the cluster. CLustering is one of the most important applications of modern computing. It is, for example, a vital technique for machine learning. Suppose you are in the marketing department of a food delivery firm, and you want to tailor your advertisements so you can appeal to different customers in relevant ways. You can use clustering methods to divide your customer base into groups based on some key characteristics, and then you can choose adverisements that are best for each group. Then, for any given customer, the algorithm can identify which group the customer falls in, and show them the corresponding advertisemments. This is a very major application of clustering, known as market segmentation. We can look at a very simple example to illustrate this process.
Suppose we have data on how much our customers have spent on food delivery, as well as the number of different purchases they have made. This data can be represented on a scatter plot, as seen below with sample data.

In [None]:
Image("images/scatter.png", width=1000)

Each point represents a customer. The x-axis represents the number of orders placed in a year, and the y-axis represents the average amount of money spent per order. This data has three separate clusters. For example, there is a cluster of data points towards the left edge of the graph. These data points all have very low x values, and so represents customers who rarely order out. 

**Q:** What groups of customers could the other two clusters represent?

**A:** <font color='blue'> Along the lines of- Those who order out a lot, but do not spend a lot on orders, and those who order out frequently, and spend a lot per order. </font>

Finding patterns in their customers buying habits can be useful to companies. For example, you may want to focus your advertising differently to these different types of customers. For larger data sets in many dimensions, it may be impossible to visualize the data at all. In that case, these clusters can also be identified algorithmically, using the same techniques we used to find MSTs. We will use Kruskal's algorithm. In every step of Kruskal's algorithm we add the shortest edge possible. Since the edge is chosen so that it does not form a cycle, each added edge connects two previously unconnected components of the graph. At the end of the last step, the whole graph is connected, so right before the second-last step, the graph is divided into three components. Since the edges we added were connecting parts that were closest to each other, the three components at the end are also going to be the clusters. Below, the points are shown as a graph, with the approximate distances between some pairs of nodes shown as weighted edges. A computer would run this algorithm on the graph with all nodes connected to each other, but for better visualization only a subset of the edges are shown. The MST on the graph is shown by the highlighted edges.

In [None]:
# Load the data and create a graph G
nodes = pd.read_csv('data/spending.csv', index_col=0)
edges = pd.read_csv('data/spending_edges.csv', index_col=0)
G = vl.create_network(nodes, edges)

# Generate the MST
mst = vl.kruskals(G)
# Plot the graph G
show(vl.tree_plot(G, tree = mst, width=900, height=600))

**Q:** What would have been the last two edges Kruskal's algorithm would have added?

**A:** <font color='blue'> The edge between index 3 and 14, and the edge between index 22 and 27. </font>

Run the cell below, and run Kruskal's algorithm on the graph till you get the above MST. Verify that the last 2 edges added were as you answered. Then, take two seps back to before those two edges were added. The data is divided into the three clusters that we could see.

In [None]:
show(vl.mst_algorithm_plot(G, algorithm='kruskals', width=900, height=600))

The data we have is not usually as cleanly divisible into clusters as this, however. For example, below, some data is illustrated, which has two easily discernible clusters. However, you can see that some of the data points below are not really similar to any of the other points, and so cannot really be said to be a part of either cluster. These points are outliers.

In [None]:
# Load the data and create a graph G
nodes = pd.read_csv('data/cluster.csv', index_col=0)
edges = pd.read_csv('data/cluster_edges.csv', index_col=0)
G = vl.create_network(nodes, edges)

# Plot the graph G
show(vl.tour_plot(G, tour=[], width = 900, height = 600))

**Q:** Can you identiy which nodes in the above data are outliers

**A:** <font color='blue'> Nodes 18 and 19 </font>

Suppose we remove the outliers from the dataset. We can then use the methods we used to find the MST to identify the clusters in the data. Run the algorithm below, and after it finishes, go to the previous step, and the clusters will be connected.

In [None]:
# Load the data and create a graph G
nodes = pd.read_csv('data/cluster2.csv', index_col=0)
edges = pd.read_csv('data/cluster2_edges.csv', index_col=0)
G = vl.create_network(nodes, edges)

# Plot the graph G
show(vl.mst_algorithm_plot(G, algorithm='kruskals', width=500, height=600))

So as we saw, Kruskal's can be used to separate the graph into the two components that are furthest from each other. When the graph has an outlier, the component farthest from the rest is going to be an outlier. So we can use this same algorithm to automatically identify outliers. Follow the same process as above, and when you get to the end and go back one step, you will see that all the other nodes are connected to each other, and there is one outlier.

In [None]:
# Load the data and create a graph G
nodes = pd.read_csv('data/cluster.csv', index_col=0)
edges = pd.read_csv('data/cluster_edges.csv', index_col=0)
G = vl.create_network(nodes, edges)

# Plot the graph G
show(vl.mst_algorithm_plot(G, algorithm='kruskals', width=900, height=600))

So, the algorithm identifies node 19 as an outlier. If we remove node 19, and then run it again, we can identify the other outlier.

In [None]:
# Load the data and create a graph G
nodes = pd.read_csv('data/cluster3.csv', index_col=0)
edges = pd.read_csv('data/cluster3_edges.csv', index_col=0)
G = vl.create_network(nodes, edges)

# Plot the graph G
show(vl.mst_algorithm_plot(G, algorithm='kruskals', width=900, height=500))

Thus, we have identified all the outliers, and after removing them, we have already seen how the clusters can be found. 

We can apply this method to large data sets, like the taxi data from New York City 2014. We have data for how many taxis were hailed in every 15 minute period of every day of 2014. For example, this is the start of the data from the first week of 2014. 

In [None]:
with open('data/taxi_count_dict.pickle', 'rb') as handle:
    taxi_counts = pd.DataFrame(pickle.load(handle))
print(taxi_counts.loc[0:6])

The columns m and d together represent the date, and the weekday column represents what day of the week that day was, so 0 is a Monday, 1 is a Tuesday, and so on. The count_vector is a list of 96 numbers, which represents how many taxis were hailed in each fifteen minute period of the day, starting from midnight to 12:15AM.
This data is visualized in the below image, and it is easy to see that there are patterns for each day. 

In [None]:
Image("images/taxi_ride_frequency.png", width=1000)

We can try to form clusters separating the days so that they can be identified be whether or not they are a weekend. However, even looking at the data quickly, one can see that there are many outliers, that do not fall into any set pattern. 

**Q:** Looking at the above graphic, what can you tell about the days that are outliers?

**A:** <font color='blue'> Answer should be along the lines of holidays, and days around them, tend to be outliers. Also, days with less than typical ridership are outliers. </font>

We need to remove these outliers. By using the method above, the outliers were identified one by one, and they were found to be special days, like holidays or the days around holidays, where the pattern is not like any other day. These days were often characterized by very few riders and so days with less than 70,000 total riders were removed, as were the 1st and 2nd of January, 31st of December, Martin Luther King's day, President's day, and the start and end of daylight savings time. This left us with a set of 344 days. We can then form clusters on this data, classifying a node based on whether it is a weekday or a weekend. 
After removing the outliers, the clusters we get are as below. The weekdays are the red nodes, Saturdays are green, and Sundays are Blue. Note that even though the Saturdays and Sundays look separated, they are part of the same cluster (connected by an edge in the tree). This is because while we could try to separate the data into three clusters, the data is too similar for Kruskal's to not select the edge that connects the Saturdays to the Sundays.

In [None]:
Image("images/cg.graphml.png", width=1000)