# Practice Session 08: Connected components and k-core decomposition

In this session we will use [NetworkX](https://networkx.github.io/) to compute the number of connected components and the size of the largest connected component on a graph. We will use the [Star Wars graph](https://github.com/evelinag/StarWars-social-network/tree/master/networks).

The dataset is contained in this input file that you will find in our [data](https://github.com/chatox/networks-science-course/tree/master/practicum/data) directory:
* ``starwars.graphml``: co-occurence of characters in scenes in the Star Wars saga in [GraphML](http://graphml.graphdrawing.org/) format.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

# 1. The Star Wars graph

The following code just loads the *Star Wars* graph into variable *g*. Leave as-is.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
import io
import networkx as nx
import matplotlib.pyplot as plt
import random
import numpy as np
import statistics

In [None]:
INPUT_GRAPH_FILENAME = "starwars.graphml"

In [None]:
# LEAVE AS-IS

# Read the graph in GraphML format
g_in = nx.read_graphml(INPUT_GRAPH_FILENAME)

# Re-label the nodes so they use the 'name' as label
g_relabeled = nx.relabel.relabel_nodes(g_in, dict(g_in.nodes(data='name')))

# Convert the graph to undirected
g = g_relabeled.to_undirected()

In [None]:
# LEAVE AS-IS (OR MODIFY IF YOU WANT)

def plot_graph(g):

    # Create a plot of 20x14
    plt.figure(figsize=(20,14))

    # Layout the nodes using a spring model
    nx.draw_spring(g, with_labels=True, node_size=1, bbox=dict(facecolor="yellow", edgecolor='black', boxstyle='round,pad=0.1'))

    # Display
    plt.show()
    
plot_graph(g)

<font size="+1" color="red">Replace this cell with your answer to the following: is this a connected graph? Why or why not?</font>

Next, compute the maximum, average, standard deviation, and mode of the degree of the nodes.

You can use functions in the [numpy](https://numpy.org/) and [statistics](https://docs.python.org/3/library/statistics.html) modules for this.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to compute the average and standard deviation of the degree of the nodes.</font>

<font size="+1" color="red">Replace this cell with your answer to the following: is this a scale-free network? Why or why not?</font>

# 2. Remove a fraction of edges

The following function, which you should leave as-is, returns a new graph which is a copy of *g* in which a fraction *p* of edges have been removed.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
# LEAVE AS-IS

def remove_edges_uniformly_at_random(g_in, p):
    # Check input is within bounds
    if p < 0.0 or p > 1.0:
        raise ValueError
    
    # Create a copy of the input graph
    g_out = g_in.copy()
    
    # Decide how many edges should be in the output graph
    target_num_edges = int((1.0-p) * g_in.number_of_edges())

    # While there are more edges than desired
    while g_out.number_of_edges() > target_num_edges:
        
        # Remove one random edge
        edge = random.choice(list(g_out.edges()))
        
        if g_out.has_edge(edge[0], edge[1]):
            g_out.remove_edge(edge[0], edge[1])
    
    # Return the resulting graph
    return g_out

Use `remove_edges_uniformly_at_random(g, p)` to create three graphs named *g10*, *g50*, and *g90* that should contain 10%, 50%, and 90% of the edges in the original graph. Then, plot those three graphs.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to create g10, g50, g90 as described above, and to plot these three graphs. Use three different cells for the plots.</font>

<font size="+1" color="red">Replace this cell with a brief commentary of what you observe visually in these three graphs with respect to the number of connected components, the number of singletons, and the size of the largest connected components.</font>

The following function, `remove_edges_by_betweenness(g, p)`, which you should leave as-is, removes a fraction *p* of the edges with the highest betweenness.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
# LEAVE AS-IS

def remove_edges_by_betweenness(g_in, p):
    # Check input is within bounds
    if p < 0.0 or p > 1.0:
        raise ValueError
        
    # Create a copy of the input graph
    g_out = g_in.copy()
    
    # Compute edge betweenness
    edge_betweenness = nx.algorithms.centrality.edge_betweenness(g_out)
    edges_by_betweenness = sorted(edge_betweenness.items(), key=lambda x:x[1], reverse=True)
    
    # Decide how many edges should be in the output graph
    target_num_edges = int((1.0-p) * g_in.number_of_edges())

    # While there are more edges than desired
    while g_out.number_of_edges() > target_num_edges:
        
        to_remove = edges_by_betweenness.pop(0)
        edge_to_remove = to_remove[0]
                
        g_out.remove_edge(edge_to_remove[0], edge_to_remove[1])
    
    # Return the resulting graph
    return g_out

Next, we use this function to remove the top 50% of the edges by betweenness.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
# LEAVE AS-IS

g50b = remove_edges_by_betweenness(g, 0.50)
plot_graph(g50b)

<font size="+1" color="red">Replace this cell with a brief commentary of what you observe visually in this graph where the top 50% of edges by betweenness was removed, in comparison with the graph in which 50% of the edges were removed uniformly at random. Explain why do you think this happens.</font>

# 3. Number of connected components

Next, we will write some code to count the number of connected components.

This code will be structured around two functions: `assign_component` and `assign_component_recursive`.

The function `assign_component(g)` takes as input a graph *g*, and returns a dictionary that maps every node in *g* to a positive integer indicating its connected component number. The `assign_component` function should do the following:

1. Create an empty dictionary `node2componentid`
1. Start with `componentid = 1`
1. Iterate through all the nodes in the graph: `for node in g.nodes()`
1. For each `node` that is not in the `node2componentid` dictionary (i.e., `if node not in node2componentid`), call `assign_component_recursive`, incrementing *componentid* by 1 in each call
1. Return the `node2componentid` dictionary.

The function `assign_component_recursive(g, node2componentid, starting_node, componentid)` takes the following arguments:

1. A graph *g*
1. A dictionary *node2componentid*
1. A starting node *starting_node*
1. A number *componentid*

The function should do the following:

1. Set `node2componentid[starting_node] = componentid`.
1. For each neighbor in `g.neighbors(starting_node)`, if that neighbor is not in the `node2componentid` dictionary, call the function `assign_component_recursive(g, node2componentid, neighbor, componentid)`. 

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code for assign_component_recursive.</font>

<font size="+1" color="red">Replace this cell with your code for assign_component.</font>

The code below, which you should leave as-is, returns the number of connected components. 

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
# LEAVE AS-IS

def count_connected_components(g):
    # Call the function to assign each node to a connected component
    node2componentid = assign_component(g)
    
    # Count the number of distinct values in this assignment
    return len(set(node2componentid.values()))

The following code, which you should leave as-is, computes how many connected components are in graphs in which 0%, 2%, 4%, ..., 98% of the edges are removed. 

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
# LEAVE AS-IS

def generate_graphs_by_removing_edges(graph, method):
    ncomponents_after_reducing = {}
    for p in np.arange(0.0, 1.02, 0.02):
        print("- {:.0f}% of the nodes".format(p*100))
        reduced_graph = method(graph, p)
        ncomponents_after_reducing[p] = count_connected_components(reduced_graph)
    return ncomponents_after_reducing

print("Generating graphs by removing edges uniformly at random")
components_removing_uniformly_at_random = generate_graphs_by_removing_edges(g, remove_edges_uniformly_at_random)

print("Generating graphs by removing edges by betweenness")
components_removing_by_betweenness = generate_graphs_by_removing_edges(g, remove_edges_by_betweenness)

Create a plot in which in the x axis is the fraction of removed edges, and in the y axis the number of connected components. **Include both lines (uniformly at random and betweenness) in the same graph, include a legend, and remember to labels both axes.** 

A basic scatter plot from a dictionary *d* is obtained as follows.

```python
x_vals = sorted(d.keys())
y_vals = [d[x] for x in x_vals]
plt.plot(x_vals, y_vals, ...)
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to create the described graph.</font>

<font size="+1" color="red">Replace this cell with a brief commentary with what you observe on this graph. Do you see a linear trend, or something else?</font>

# 4. Largest connected component

Write a function `size_largest_connected_component` to compute the size of the largest connected component on a graph. Basically you need to call `assign_component` and then iterate through the nodes, counting how many times you see each *componentid*, and returning the maximum of this.

To obtain the maximum value on a dictionary, e.g., *component_sizes*, you can use `np.max(list(component_sizes.values()))`

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code for "size_largest_connected_component".</font>

Next, use the `size_largest_connected_component` function to obtain data to create a plot. 

The data you should obtain will be in two dictionaries:

* `largest_wcc_removing_uniformly_at_random[p]` should contain the size of the largest connected component obtained when removing a fraction p of edges uniformly at random
* `largest_wcc_removing_uniformly_at_random[p]` should contain the size of the largest connected component obtained when removing a fraction p of edges by betweenness

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to generate the requested graph.</font>

Next, create a plot. In this plot, in the x axis there should be the fraction of removed nodes and in the y axis the size of the largest connected component as a fraction of the total number of nodes. **Both axis should go from 0.0 to 1.0**; remember to include both series, to include a legend, and to label the axes.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to generate the requested graph.</font>

<font size="+1" color="red">Replace this cell with a brief commentary indicating what you see in this plot, and answering the following questions: (1) approximately what percentage of edges do you need to remove at random to make the largest connected component shrink to 90% of the nodes in the graph? (2) approximately what percentage of top edges by betweenness do you need to remove at random to make the largest connected component shrink to 90% of the nodes in the graph?</font>

# 5. K-core decomposition

Now we will perform a k-core decomposition, using the following auxiliary functions, which you can leave as-is.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
# LEAVE AS-IS

def get_max_degree(g):
    degree_sequence = [x[1] for x in g.degree()]
    return(max(degree_sequence))


def nodes_with_degree_less_or_equal_than(g, degree):
    nodes = []
    for node in g.nodes():
        if g.degree(node) <= degree:
            nodes.append(node)
    return nodes

Complete the code for function `kcore_decomposition(g)`; to use this function, you do `node_to_kcore = kcore_decomposition(g)`.

```python
def kcore_decomposition(graph):
    g = graph.copy()
    max_degree = get_max_degree(g)

    node_to_level = {}
    for level in range(1, max_degree + 1):

        while True:
            # Obtain the list of nodes with degree <= level
            nodes_in_level = nodes_with_degree_less_or_equal_than(g, level)

            # Check if this list is empty
            if len(nodes_in_level) == 0:
                # TO-DO: implement (one line)

            # If the list is not empty, assign the nodes to the
            # corresponding level and remove the node
            for node in nodes_in_level:
                # TO-DO: implement this (two lines)

    return(node_to_level)
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code for "kcore_decomposition". Please remember to include enough comments to explain what your code is doing.</font>

Test your code. The following should print:

```python
K-core of JANSON: 1
K-core of RED TEN: 2
K-core of LUKE: 7
K-core of YODA: 8
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
# LEAVE AS-IS

node_to_kcore = kcore_decomposition(g)

for character in ["JANSON", "RED TEN", "LUKE", "YODA"]:
    print("K-core of {:s}: {:d}".format(character, node_to_kcore[character]))

<font size="+1" color="red">The following code, which you should leave as-is, displays the graph annotated with the k-core number of each node.</font>

In [None]:
# LEAVE AS-IS

# Compute k-core decomposition
node_to_kcore = kcore_decomposition(g)

# Rename nodes so they include the k-core
node_to_kcore_texts = dict([(name, str(node_to_kcore[name]) + ":" + name) for name in g.nodes()])
h = nx.relabel_nodes(g, node_to_kcore_texts)

# Draw the graph
plt.figure(figsize=(20,20))
nx.draw_spring(h, with_labels=True, node_size=1, bbox=dict(facecolor="yellow", edgecolor='black', boxstyle='round,pad=0.1'))
plt.show()

<font size="+1" color="red">Replace this cell with a brief commentary on the graph you see, including which kinds of characters you find at different k-core levels.</font>

# Deliver, your code you must (individually)

A .zip file containing:

* This notebook.


## Available, extra points are

For extra points and extra learning (+2, so your maximum grade can be a 12 in this assignment), note that in our graphs for the number of connected components and size of largest connected component, the line is not smooth. This is because there are random variations as the removal of edges is random every time. To fix this, you will need to repeat each experiment, e.g., 100 times, and plot the average line. For extra points, replace the graphs of number of connected components and size of largest connected component with an average of multiple experimental runs.

**Note:** if for extra points you go, ``<font size="+2" color="blue">Additional results: multiple experiments per graph</font>`` at the top of your notebook, you must add.

<font size="-1" color="gray">(This cell, when delivering, remove.)</font>

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>