# Module 2 Assignment

In [1]:
import re
import time
import pickle


from networkx.drawing.nx_pydot import graphviz_layout
from networkx.algorithms import community
from networkx.algorithms.community import modularity
import networkx as nx

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
#import seaborn as sns
%matplotlib inline

## <font color='red'>Important</font>

<font color='red' size=2>Please **AVOID** using `community` and `modularity` as your variable names. These are imported as preserved names for networkx submodules. Changing their representations would result in autograder failures.</font>

In [2]:
# disable warnings
import warnings
warnings.filterwarnings('ignore')

## Part 1: Wikipedia Network with Communities

## Data description

In this assignment, we are going to analyze the community structure of a network. We will use a Wikipedia based [Map of Science](https://figshare.com/articles/A_Wikipedia_Based_Map_of_Science/11638932) network for our exploration. In this network, each node represents a Wikipedia page in a domain of science, such as natural science or social science. An edge exists between two nodes if the cosine similarity of their page contents reaches a pre-defined threshold.

In [3]:
G = nx.read_gml('assets/MapOfScience.gml', label='id')

Each node in the graph contains the following attributes:
- "name:" the title of the article 
- "Class" the science domain 
- "WikipediaUrl:" the Wikipedia URL 

Let's look at some examples: 

In [4]:
list(G.nodes(data=True))[0:5]

[(0,
  {'label': '0',
   'name': 'Accounting',
   'Class': 'Applied',
   'WikipediaUrl': 'https://en.wikipedia.org/wiki/Accounting'}),
 (1,
  {'label': '1',
   'name': 'Aerospace engineering',
   'Class': 'Applied',
   'WikipediaUrl': 'https://en.wikipedia.org/wiki/Aerospace_engineering'}),
 (2,
  {'label': '2',
   'name': 'Agricultural engineering',
   'Class': 'Applied',
   'WikipediaUrl': 'https://en.wikipedia.org/wiki/Agricultural_engineering'}),
 (3,
  {'label': '3',
   'name': 'Agricultural science',
   'Class': 'Applied',
   'WikipediaUrl': 'https://en.wikipedia.org/wiki/Agricultural_science'}),
 (4,
  {'label': '4',
   'name': 'Agronomy',
   'Class': 'Applied',
   'WikipediaUrl': 'https://en.wikipedia.org/wiki/Agronomy'})]

The edges contain the cosine similarity of the text of the two articles：

In [5]:
list(G.edges(data=True))[0:5]

[(0, 21, {'CosineSimilarity': 0.369447477753246}),
 (0, 50, {'CosineSimilarity': 0.395432741205435}),
 (0, 70, {'CosineSimilarity': 0.388758063740006}),
 (0, 88, {'CosineSimilarity': 0.371542879166867}),
 (0, 516, {'CosineSimilarity': 0.365688238862527})]

Let's extract the largest connected component from the above graph. In the following questions, we will focus our analysis on this sub-graph. 

In [6]:

# This line extracts the largest connected component of the original dataset

G = G.subgraph(max(nx.connected_components(G), key=len))


### Task 1a. (1 Point, Autograded)

How many nodes are there in this new network? Assign the answer to the variable `N`.

Hint: the `.nodes()` function returns a list of all the nodes. How might you find the number of items in a list?

In [7]:
task_id = '1a'


In [8]:
def task_1a_solution():
    N = G.number_of_nodes()  # counts number of nodes in the largest connected component
    return N


In [9]:
# Use this cell to explore your solution.

task_1a_solution()

677

In [10]:
print(f"Task {task_id} - AG tests")
stu_ans = task_1a_solution()

print(f"Task {task_id} - your answer:\n{stu_ans}")

assert isinstance(
    stu_ans, int
), f"Task {task_id}: Your function should return an integer. "


#hidden tests for Question 1 are within this cell

Task 1a - AG tests
Task 1a - your answer:
677


### Task 1b: (2 points, Autograded)

If we think of science 'classes' as communities, how many communities are there in the network, and what are the names of all the classes?

Assign `N` to the number of classes in `G`, and `set_of_communities` to a set containing the names of the classes.

Hint: remember each node has attributes such as ['name'] and ['Class'], the latter of which is relevant for this assignment.



In [11]:
task_id = '1b'


In [12]:
def task_1b_solution():
    set_of_communities = set(nx.get_node_attributes(G, "Class").values())
    N = len(set_of_communities)
    return N, set_of_communities


In [13]:
# Use this cell to explore your solution.

task_1b_solution()

(4, {'Applied', 'Formal', 'Natural', 'Social'})

In [14]:
print(f"Task {task_id} - AG tests")
stu_N, stu_set_of_communities = task_1b_solution()

print(f"Task {task_id} - your answer:\n{stu_N, stu_set_of_communities}")

assert isinstance(
    stu_N, int
), f"Task {task_id}: N should be an integer. "

assert isinstance(
    stu_set_of_communities, set
), f"Task {task_id}: set_of_communities should be a set. "


#hidden tests for Question 2 are within this cell

Task 1b - AG tests
Task 1b - your answer:
(4, {'Formal', 'Applied', 'Natural', 'Social'})


## Part 2: Measures of Partition Quality

We discussed 5 ways to measure the quality of a partition: modularity, coverage, performance, separability, and density. 

Modularity, coverage, and performance can be measured using `networkx` functions in the `algorithms.community` module. You can check all the functions provided by a module with the built-in `dir()` function:

```python
from networkx.algorithms import community
dir(community)
>>> ...
 'modularity',
 'partition_quality',
```

Descriptions of the three measures are provided in the source code, <br>
(note that the second function returns both coverage and performance)

1. `networkx.algorithms.community.modularity(G, communities, weight='weight')`

2.  `networkx.algorithms.community.partition_quality(G, partition)`



For separability and density, you will implement your own functions. 

Since separability and density first apply a measure to each community and then takes the average over all communities, we provide a helper function `avg_measure(G, communities, measure)` that computes the average value for a given measure over all communities in a graph:

In [15]:
def avg_measure(G, communities, measure):
    """
    Calculate the average value of a given measure across communities in a graph.
    
    Args:
        G - nx graph object; a graph to operate upon.
        communities - list; a collection of sets of nodes that each comprise a community.
        measure - function; calculates a measure for a community in a graph.
    
    Returns:
        (unnamed) - float; the calculated average measure.
    """
    sum_ = 0
    for comm in communities:
        sum_ += measure(G, comm)
    return sum_ / len(communities)

### Task 2a: (10 points, Autograded) 

Let's begin implementing the function to measure the separability for a single community. That is, measure the ratio of intra-community to inter-community edges. 

If there are 0 inter-community edges, the function should assume that the actual number is 1 and return the number of intra-community edges. 

Hints: 

- `G.edges(community)`, where `community` is a set of nodes, returns the edges that are incident to at least one node in `community`. That is, it returns the edges that have at least one endpoint in `community`. 

- Remember that all edges are either inter- or intra- community edges.

In [16]:
task_id = '2a'



In [17]:
def task_2a_solution(G, community):
    """
    Calculate the separability of a community by finding the ratio of
    intra-community edges to inter-community edges.
    
    Args:
        G - nx graph object; a graph to operate upon.
        community - set; a collection of nodes that comprise a community.
    
    Returns:
        result - float; the separability in a given community.
    """
    edges = G.edges(community)
    num_intra_edges = 0

    for u, v in edges:
        if u in community and v in community:
            num_intra_edges += 1

    total_edges = len(list(edges))
    num_inter_edges = total_edges - num_intra_edges

    if num_inter_edges == 0:
        result = num_intra_edges / 1
    else:
        result = num_intra_edges / num_inter_edges

    return result
    



In [18]:
# Use this cell to explore your solution.

task_2a_solution(G, [0,1])

0.0

In [19]:
print(f"Task {task_id} - AG tests")
stu_ans = task_2a_solution(G, [0,1])

print(f"Task {task_id} - your answer:\n{stu_ans}")

assert isinstance(
    stu_ans, float
), f"Task {task_id}: Your function should return a float. "


#hidden tests for Question 4 are within this cell

Task 2a - AG tests
Task 2a - your answer:
0.0


Now we can simply use `avg_measure(G, communities, separability_one_community)` to measure the separability of a partition. 

### Task 2b: (10 points, Autograded) 

Let's now implement the function to measure the density of a single community. That is, the fraction of intra-community edges out of all possible edges. 

Hint: The equation to calculate density is 


2 * (no. of intra-community edges) / (no. of community edges * (no. of community edges - 1))

In [20]:
task_id = '2b'


In [21]:
def task_2b_solution(G, community):
    """
    Calculate the density of a community by finding the fraction of
    intra-community edges out of all possible edges.
    
    Args:
        G - nx graph object; a graph to operate upon.
        community - set; a collection of nodes that comprise a community.
    
    Returns:
        result - float; the density of a given community.
    """
    
    edges = G.edges(community)
    num_intra_edges = 0
    
    if len(community) == 1:  # If the community has only one node, just return 1 
        return 1

    for u, v in edges:
        if u in community and v in community:
            num_intra_edges += 1

    n = len(community)
    possible_edges = n * (n - 1) / 2
    result = num_intra_edges / possible_edges

    return result




In [22]:
# Use this cell to explore your solution.

task_2b_solution(G, [0,1])

0.0

In [23]:
print(f"Task {task_id} - AG tests")
stu_ans = task_2b_solution(G, [0,1])

print(f"Task {task_id} - your answer:\n{stu_ans}")

assert isinstance(
    stu_ans, float
), f"Task {task_id}: Your function should return a float. "

#hidden tests for Question 5 are within this cell

Task 2b - AG tests
Task 2b - your answer:
0.0


Now we can simply use `avg_measure(G, communities, density_one_community)` to measure the density of a partition. 

## Part 3: Community Detection Algorithms

Now let's apply community detection algorithms to the Wikipedia graph. For this part, we will ignore the science domains since we are trying to find communities purely based on the structure of the network. 

We will begin with the Girvan-Newman algorithm and pick the partition that results in the largest modularity, as described in the lecture. Since computing this partition takes a long time, we have precomputed the communities in the notebook `Girvan-Newman.ipynb` (stored in the resources folder) and exported the partition to a pickle file. The pickle file is stored as `assets/answer/max_mod_community`. Note that you do not need to run the notebook `Girvan-Newman.ipynb` since we have already done it for you. We only provide it to you as a reference.

Here, we will load the partition from `assets/answer/max_mod_community`



In [24]:

max_mod_community = None

with open("assets/answer/max_mod_community", 'rb') as f:
    max_mod_community = pickle.load(f)



In [25]:
print(max_mod_community)


({0, 1, 517, 519, 526, 19, 21, 533, 535, 24, 538, 27, 540, 29, 542, 31, 32, 543, 35, 37, 549, 554, 43, 42, 560, 561, 50, 562, 48, 565, 51, 55, 564, 569, 570, 59, 572, 575, 576, 65, 578, 69, 70, 583, 585, 587, 75, 77, 590, 591, 76, 80, 593, 82, 596, 83, 595, 599, 88, 89, 603, 604, 605, 96, 613, 614, 615, 616, 617, 618, 619, 620, 622, 111, 114, 627, 116, 626, 632, 634, 635, 637, 638, 127, 129, 641, 642, 133, 645, 647, 649, 650, 137, 140, 141, 654, 655, 656, 142, 143, 659, 148, 652, 145, 666, 155, 670, 158, 159, 674, 167, 177, 183, 187, 139, 188, 190, 191, 192, 193, 195, 202, 205, 217, 219, 226, 228, 234, 237, 247, 255, 257, 267, 268, 271, 273, 274, 275, 284, 302, 313, 365, 382, 426, 427, 453, 454, 456, 461, 462, 464, 467, 468, 470, 478, 480, 481, 482, 484, 489, 490, 492, 493, 495, 497, 498, 499, 500, 505, 506}, {2, 7, 9, 10, 521, 12, 16, 18, 26, 547, 36, 553, 556, 557, 580, 588, 84, 606, 609, 611, 105, 109, 625, 628, 122, 128, 130, 646, 653, 660, 668, 157, 672, 163, 204, 212, 233, 252, 2

### Task 3a: (5 points, Autograded) 

What is the modularity, coverage, performance, density, and separability of the partition?

In [26]:
task_id = '3a'


In [27]:
from networkx.algorithms import community

def task_3a_solution(Graph):
    mod = community.modularity(Graph, max_mod_community)
    
    cov, perf = community.partition_quality(Graph, max_mod_community)

    sep = avg_measure(Graph, max_mod_community, task_2a_solution)
    den = avg_measure(Graph, max_mod_community, task_2b_solution)

    return mod, cov, perf, sep, den


In [28]:
# Use this cell to explore your solution.

task_3a_solution(G)

(0.5806872261399931,
 0.8152524167561761,
 0.8887058288830815,
 1.079485235171277,
 0.6542182538818658)

In [29]:
print(f"Task {task_id} - AG tests")
stu_mod, stu_cov, stu_perf, stu_sep, stu_den = task_3a_solution(G)

print(f"Task {task_id} - your answer:\n{stu_ans}")

assert all(
    isinstance(num, float) for num in [stu_mod, stu_cov, stu_perf, stu_sep, stu_den]
), (
    f"Task {task_id}: Your returned tuple should be a tuple of floats. "
)

#hidden tests for Question 8 are within this cell

Task 3a - AG tests
Task 3a - your answer:
0.0


### Task 3b: (12 points, Autograded) 

Find a partition of the network with the [label propagation algorithm](https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.community.label_propagation.label_propagation_communities.html) and compute the number of communities in the partition and its modularity, coverage, performance, density, and separability. 

In [30]:
task_id = '3b'



In [31]:
from networkx.algorithms.community import label_propagation_communities

def task_3b_solution(Graph):
    # Step 1: Detect communities using label propagation
    lp_communities = list(label_propagation_communities(Graph))

    # Step 2: Count number of communities
    num_community = len(lp_communities)

    # Step 3: Compute partition metrics
    mod = community.modularity(Graph, lp_communities)
    cov, perf = community.partition_quality(Graph, lp_communities)
    sep = avg_measure(Graph, lp_communities, task_2a_solution)
    den = avg_measure(Graph, lp_communities, task_2b_solution)

    return num_community, mod, cov, perf, den, sep, lp_communities



In [32]:
# Use this cell to explore your solution.

task_3b_solution(G)

(28,
 0.5841880852733243,
 0.8088077336197637,
 0.8578876526268868,
 0.6189993234063278,
 1.285532867611396,
 [{0,
   1,
   6,
   11,
   19,
   21,
   24,
   27,
   28,
   29,
   31,
   32,
   35,
   37,
   42,
   43,
   46,
   47,
   48,
   50,
   51,
   53,
   55,
   56,
   57,
   58,
   59,
   60,
   65,
   69,
   70,
   73,
   75,
   76,
   77,
   78,
   80,
   82,
   83,
   85,
   88,
   89,
   96,
   111,
   112,
   114,
   116,
   118,
   127,
   129,
   133,
   137,
   139,
   140,
   141,
   142,
   143,
   145,
   148,
   154,
   155,
   158,
   159,
   167,
   168,
   173,
   177,
   183,
   187,
   188,
   190,
   191,
   192,
   193,
   195,
   202,
   204,
   205,
   213,
   217,
   219,
   225,
   226,
   228,
   234,
   237,
   247,
   251,
   252,
   255,
   257,
   267,
   268,
   269,
   271,
   273,
   274,
   275,
   283,
   284,
   302,
   307,
   313,
   329,
   333,
   362,
   365,
   382,
   393,
   414,
   426,
   427,
   453,
   454,
   456,
   457,
   459,
 

In [33]:
print(f"Task {task_id} - AG tests")
stu_num_community, stu_mod, stu_cov, stu_perf, stu_den, stu_sep, stu_lp_communities = task_3b_solution(G)

print(f"Task {task_id} - your answer:\n{stu_num_community, stu_mod, stu_cov, stu_perf, stu_den, stu_sep, stu_lp_communities}")

assert isinstance(
    stu_num_community, int
), f"Task {task_id}: num_community should be an int. "

assert all(
    isinstance(num, float) for num in [stu_mod, stu_cov, stu_perf, stu_den, stu_sep]
), (
    f"Task {task_id}: mod, cov, perf, den, and sep should be floats. "
)

assert isinstance(
    stu_lp_communities, list
), f"Task {task_id}: lp_communities should be a list. "

#hidden tests for Question 9 are within this cell

Task 3b - AG tests
Task 3b - your answer:
(28, 0.5841880852733243, 0.8088077336197637, 0.8578876526268868, 0.6189993234063278, 1.285532867611396, [{0, 1, 517, 6, 518, 519, 11, 523, 526, 19, 21, 533, 535, 24, 538, 27, 28, 29, 540, 31, 32, 541, 542, 35, 543, 37, 549, 550, 42, 43, 554, 46, 47, 48, 559, 50, 51, 560, 53, 561, 55, 56, 57, 58, 59, 60, 564, 565, 569, 570, 65, 571, 572, 575, 69, 70, 576, 578, 73, 583, 75, 76, 77, 78, 585, 80, 587, 82, 83, 590, 85, 591, 592, 88, 89, 593, 595, 596, 599, 603, 604, 96, 605, 613, 614, 615, 616, 617, 618, 619, 620, 622, 111, 112, 114, 626, 116, 627, 118, 631, 632, 633, 634, 635, 636, 637, 638, 127, 129, 641, 642, 644, 133, 645, 647, 648, 137, 649, 139, 140, 141, 142, 143, 650, 145, 652, 654, 148, 655, 656, 659, 663, 664, 154, 155, 666, 158, 159, 670, 673, 674, 167, 168, 173, 177, 183, 187, 188, 190, 191, 192, 193, 195, 202, 204, 205, 213, 217, 219, 225, 226, 228, 234, 237, 247, 251, 252, 255, 257, 267, 268, 269, 562, 271, 273, 274, 275, 283, 284, 302

The [Clauset-Newman-Moore greedy modularity maximization algorithm](https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.community.modularity_max.greedy_modularity_communities.html#networkx-algorithms-community-modularity-max-greedy-modularity-communities) implements an Agglomerative Hierarchical Clustering procedure to find a partition with high modularity. 

Note: the function `greedy_modularity_communities` returns a list of `Frozensets`, where each `Frozenset` is simply an immutable Python `set` object (i.e. the elements cannot be modified).



### Task 3c: (12 points, Autograded) 

Using Clauset-Newman-Moore greedy modularity maximization, find a partition of the network and compute the number of communities in the partition and its modularity, coverage, performance, density, and separability.

In [34]:
task_id = '3c'


In [35]:
from networkx.algorithms.community import greedy_modularity_communities

def task_3c_solution(Graph):
    # Step 1: Detect communities
    gred_communities = list(greedy_modularity_communities(Graph))

    # Step 2: Count number of communities
    num_community = len(gred_communities)

    # Step 3: Compute metrics
    mod = community.modularity(Graph, gred_communities)
    cov, perf = community.partition_quality(Graph, gred_communities)
    sep = avg_measure(Graph, gred_communities, task_2a_solution)
    den = avg_measure(Graph, gred_communities, task_2b_solution)

    return num_community, mod, cov, perf, den, sep, gred_communities


In [36]:
task_3c_solution(G)

(14,
 0.5503560299288301,
 0.8058922817247199,
 0.7750954874009073,
 0.6085197486447432,
 1.4808561418725692,
 [frozenset({0,
             21,
             38,
             40,
             42,
             50,
             56,
             57,
             58,
             59,
             65,
             69,
             70,
             73,
             74,
             75,
             77,
             78,
             80,
             82,
             90,
             121,
             127,
             147,
             148,
             150,
             162,
             178,
             202,
             270,
             273,
             287,
             302,
             304,
             307,
             313,
             322,
             329,
             336,
             346,
             347,
             348,
             349,
             362,
             382,
             387,
             393,
             408,
             410,
             414,
            

In [37]:
print(f"Task {task_id} - AG tests")
stu_num_community, stu_mod, stu_cov, stu_perf, stu_den, stu_sep, stu_gred_communities = task_3c_solution(G)

print(f"Task {task_id} - your answer:\n{stu_num_community, stu_mod, stu_cov, stu_perf, stu_den, stu_sep, stu_gred_communities}")

assert isinstance(
    stu_num_community, int
), f"Task {task_id}: num_community should be an int. "

assert all(
    isinstance(num, float) for num in [stu_mod, stu_cov, stu_perf, stu_den, stu_sep]
), (
    f"Task {task_id}: mod, cov, perf, den, and sep should be floats. "
)

assert isinstance(
    stu_lp_communities, list
), f"Task {task_id}: lp_communities should be a list. "

#hidden tests for Question 10 are within this cell

Task 3c - AG tests
Task 3c - your answer:
(14, 0.5503560299288301, 0.8058922817247199, 0.7750954874009073, 0.6085197486447432, 1.4808561418725692, [frozenset({0, 512, 513, 514, 515, 516, 517, 518, 519, 520, 522, 523, 526, 529, 530, 531, 532, 21, 533, 534, 535, 536, 539, 540, 541, 542, 543, 544, 549, 38, 550, 40, 552, 551, 42, 554, 555, 558, 559, 561, 562, 50, 563, 566, 567, 56, 57, 58, 568, 569, 570, 571, 59, 572, 573, 574, 575, 576, 65, 577, 579, 578, 582, 581, 69, 70, 583, 74, 584, 73, 585, 75, 586, 591, 77, 590, 78, 80, 593, 82, 594, 595, 90, 598, 599, 600, 601, 602, 603, 607, 604, 605, 613, 614, 615, 616, 617, 618, 619, 620, 621, 624, 626, 627, 629, 631, 632, 121, 634, 635, 636, 637, 638, 127, 639, 640, 641, 642, 644, 643, 645, 647, 648, 649, 650, 652, 654, 655, 656, 657, 658, 659, 148, 147, 662, 661, 150, 663, 664, 665, 666, 670, 671, 673, 674, 162, 676, 178, 202, 270, 273, 287, 302, 304, 307, 313, 322, 329, 336, 346, 347, 348, 349, 362, 382, 387, 393, 589, 408, 410, 414, 422, 423

### Task 3d: (6 points, Autograded) 

Using the [asynchronous Fluid Communities algorithm](https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.community.asyn_fluid.asyn_fluidc.html#networkx.algorithms.community.asyn_fluid.asyn_fluidc) with parameters  `max_iter` = 100 and `seed` = 233 (see Tutorial for details), find the parameter `k` between `k`=3 and `k`=40 that maximizes the modularity of the partition. Note that `k` represents the number of communities in the partition. Also store the corresponding partition in `fluid_communities`.

In [40]:
task_id = '3d'


In [41]:
from networkx.algorithms.community import asyn_fluidc

def task_3d_solution(Graph):
    max_mod = 0
    best_k = 0
    fluid_communities = None

    for k in range(3, 40):
        try:
            # Try to find fluid communities
            communities = list(asyn_fluidc(Graph, k=k, max_iter=100, seed=233))
            mod = community.modularity(Graph, communities)

            if mod > max_mod:
                max_mod = mod
                best_k = k
                fluid_communities = communities
        except:
            # Skip values of k that raise an exception
            continue

    return best_k, fluid_communities


In [42]:
task_3d_solution(G)

(11,
 [{28,
   46,
   74,
   90,
   97,
   106,
   108,
   149,
   164,
   165,
   210,
   218,
   241,
   246,
   248,
   249,
   256,
   259,
   260,
   261,
   263,
   265,
   266,
   277,
   313,
   345,
   361,
   382,
   408,
   451,
   501,
   541,
   592,
   633,
   636,
   669},
  {17,
   18,
   24,
   39,
   53,
   60,
   76,
   83,
   87,
   292,
   293,
   294,
   295,
   298,
   299,
   308,
   314,
   316,
   324,
   331,
   332,
   334,
   339,
   343,
   347,
   349,
   357,
   359,
   360,
   365,
   366,
   367,
   369,
   384,
   387,
   402,
   403,
   412,
   416,
   423,
   432,
   476,
   548,
   610,
   623},
  {9,
   14,
   20,
   36,
   55,
   68,
   123,
   209,
   243,
   258,
   272,
   286,
   289,
   296,
   300,
   309,
   312,
   323,
   328,
   330,
   335,
   338,
   344,
   353,
   374,
   376,
   378,
   379,
   381,
   386,
   392,
   395,
   397,
   404,
   418,
   420,
   429,
   434,
   435,
   437,
   440,
   448,
   450,
   611},
  {2,
   3,
 

In [43]:
print(f"Task {task_id} - AG tests")
stu_best_k, stu_fluid_communities = task_3d_solution(G)

print(f"Task {task_id} - your answer:\n{stu_best_k, stu_fluid_communities}")

assert isinstance(
    stu_best_k, int
), f"Task {task_id}: best_k should be an int. "

assert isinstance(
    stu_fluid_communities, list
), f"Task {task_id}: fluid_communities should be a list. "

#hidden tests for Question 11 are within this cell

Task 3d - AG tests
Task 3d - your answer:
(11, [{256, 259, 260, 261, 263, 265, 266, 149, 277, 408, 28, 541, 669, 164, 165, 46, 313, 451, 74, 592, 210, 345, 218, 90, 97, 361, 106, 108, 241, 633, 501, 246, 248, 249, 636, 382}, {384, 387, 17, 402, 403, 18, 24, 412, 416, 292, 293, 548, 39, 295, 294, 298, 299, 423, 432, 308, 53, 314, 316, 60, 324, 331, 332, 76, 334, 83, 339, 87, 343, 347, 476, 349, 610, 357, 359, 360, 365, 366, 623, 367, 369}, {258, 386, 392, 9, 395, 397, 14, 272, 20, 404, 286, 289, 418, 36, 420, 296, 300, 429, 434, 435, 437, 309, 55, 312, 440, 448, 450, 323, 68, 328, 330, 335, 209, 338, 344, 353, 611, 123, 243, 374, 376, 378, 379, 381}, {2, 3, 4, 7, 521, 10, 12, 13, 16, 26, 546, 547, 553, 41, 556, 557, 44, 559, 567, 571, 66, 580, 588, 84, 606, 609, 105, 109, 625, 628, 631, 122, 128, 130, 644, 648, 653, 660, 668, 157, 672, 673, 163, 212, 233, 285, 287, 288, 291, 301, 307, 318, 321, 325, 340, 342, 350, 352, 354, 355, 358, 363, 364, 370, 371, 372, 373, 388, 399, 407, 415, 417

### Q3e. (6 points, Autograded) 

Using the [asynchronous Fluid Communities algorithm](https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.community.asyn_fluid.asyn_fluidc.html#networkx.algorithms.community.asyn_fluid.asyn_fluidc) with parameters  `max_iter` = 100, `seed` = 233, and the value of `k` you found in the previous question, compute the number of communities in the partition and its modularity, coverage, performance, density, and separability.

Hint(s):

- fluid_communities should be converted into a list as in the tutorial.

In [44]:
task_id = '3e'


In [45]:
from networkx.algorithms.community import asyn_fluidc

def task_3e_solution(Graph):
    # Use the best k found previously
    best_k, _ = task_3d_solution(Graph)

    # Get the fluid communities
    fluid_communities = list(asyn_fluidc(Graph, k=best_k, max_iter=100, seed=233))

    # Number of communities
    num_community = len(fluid_communities)

    # Partition quality metrics
    mod = community.modularity(Graph, fluid_communities)
    cov, perf = community.partition_quality(Graph, fluid_communities)
    sep = avg_measure(Graph, fluid_communities, task_2a_solution)
    den = avg_measure(Graph, fluid_communities, task_2b_solution)

    return num_community, mod, cov, perf, den, sep, fluid_communities


In [46]:
# Use this cell to explore your solution.

task_3e_solution(G)

(11,
 0.5908556366160678,
 0.728249194414608,
 0.915127651578055,
 0.2114226288880661,
 1.374655967533157,
 [{28,
   46,
   74,
   90,
   97,
   106,
   108,
   149,
   164,
   165,
   210,
   218,
   241,
   246,
   248,
   249,
   256,
   259,
   260,
   261,
   263,
   265,
   266,
   277,
   313,
   345,
   361,
   382,
   408,
   451,
   501,
   541,
   592,
   633,
   636,
   669},
  {17,
   18,
   24,
   39,
   53,
   60,
   76,
   83,
   87,
   292,
   293,
   294,
   295,
   298,
   299,
   308,
   314,
   316,
   324,
   331,
   332,
   334,
   339,
   343,
   347,
   349,
   357,
   359,
   360,
   365,
   366,
   367,
   369,
   384,
   387,
   402,
   403,
   412,
   416,
   423,
   432,
   476,
   548,
   610,
   623},
  {9,
   14,
   20,
   36,
   55,
   68,
   123,
   209,
   243,
   258,
   272,
   286,
   289,
   296,
   300,
   309,
   312,
   323,
   328,
   330,
   335,
   338,
   344,
   353,
   374,
   376,
   378,
   379,
   381,
   386,
   392,
   395,
   397,


In [47]:
print(f"Task {task_id} - AG tests")
stu_ans = task_3e_solution(G)

print(f"Task {task_id} - your answer:\n{stu_ans}")

stu_num_community, stu_mod, stu_cov, stu_perf, stu_den, stu_sep, stu_fluid_communities = stu_ans

assert isinstance(
    stu_num_community, int
), f"Task {task_id}: num_community should be an int. "

assert all(
    isinstance(num, float) for num in [stu_mod, stu_cov, stu_perf, stu_den, stu_sep]
), (
    f"Task {task_id}: mod, cov, perf, den, and sep should be floats. "
)

assert isinstance(
    stu_fluid_communities, list
), f"Task {task_id}: lp_communities should be a list. "

#hidden tests for Question 12 are within this cell

Task 3e - AG tests
Task 3e - your answer:
(11, 0.5908556366160678, 0.728249194414608, 0.915127651578055, 0.2114226288880661, 1.374655967533157, [{256, 259, 260, 261, 263, 265, 266, 149, 277, 408, 28, 541, 669, 164, 165, 46, 313, 451, 74, 592, 210, 345, 218, 90, 97, 361, 106, 108, 241, 633, 501, 246, 248, 249, 636, 382}, {384, 387, 17, 402, 403, 18, 24, 412, 416, 292, 293, 548, 39, 295, 294, 298, 299, 423, 432, 308, 53, 314, 316, 60, 324, 331, 332, 76, 334, 83, 339, 87, 343, 347, 476, 349, 610, 357, 359, 360, 365, 366, 623, 367, 369}, {258, 386, 392, 9, 395, 397, 14, 272, 20, 404, 286, 289, 418, 36, 420, 296, 300, 429, 434, 435, 437, 309, 55, 312, 440, 448, 450, 323, 68, 328, 330, 335, 209, 338, 344, 353, 611, 123, 243, 374, 376, 378, 379, 381}, {2, 3, 4, 7, 521, 10, 12, 13, 16, 26, 546, 547, 553, 41, 556, 557, 44, 559, 567, 571, 66, 580, 588, 84, 606, 609, 105, 109, 625, 628, 631, 122, 128, 130, 644, 648, 653, 660, 668, 157, 672, 673, 163, 212, 233, 285, 287, 288, 291, 301, 307, 318, 3

# Part 4: Analysis


Run the code below. It will produce a pandas DataFrame where each row represents a community detection algorithm and each column is a quality measure. This will be relevant for the next question.




In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Task 4a:

Fill the values in the table_analysis function below with strings containing the corresponding Method name as they appear in the table above. For example, if highest_modularity is the Greedy modularity maximization algorithm, you should assign "Greedy modularity maximization" to highest_modularity.



In [None]:
task_id = '4a'


In [None]:
def task_4a_solution():
    highest_modularity = None
    highest_coverage = None
    highest_performance = None
    highest_density = None
    highest_separability = None

    # YOUR CODE HERE
    raise NotImplementedError()

    return highest_modularity, highest_coverage, highest_performance, highest_density, highest_separability

In [None]:
# Use this cell to explore your solution.

task_4a_solution()

In [None]:
print(f"Task {task_id} - AG tests")
stu_ans = task_4a_solution()

print(f"Task {task_id} - your answer:\n{stu_ans}")

assert all(
    isinstance(num, str) for num in stu_ans
), (
    f"Task {task_id}: You should return a tuple of strs. "
)

#hidden tests for Question 13 are within this cell

Now, we will examine a network of football games, with undirected edges forming between teams that played each other at least once in 2000.

You may notice each node has a 'value' field. This refers to the teams' conferences.

In [None]:
G_football = nx.read_gml('assets/football.gml', label='id')

max_mod_community_football = None

with open("assets/answer/max_mod_community2", 'rb') as f:
    max_mod_community_football = pickle.load(f)

In [None]:
(list(G_football.nodes(data=True))[0:5])

In [None]:
lp_communities_football = task_3b_solution(G_football)[6]
gred_communities_football = task_3c_solution(G_football)[6]
fluid_communities_football = task_3d_solution(G_football)[1]

compare = None # assign your pandas dataframe here
method_name = ["Girvan–Newman", "Greedy modularity maximization", "Fluid communities", "Label propagation"]
compare = pd.DataFrame(method_name, columns=["Method name"])
methods = [max_mod_community_football, gred_communities_football, fluid_communities_football, lp_communities_football]
compare['num_community'] = [len(med) for med in methods]
compare['modularity'] = [modularity(G_football, med) for med in methods]
compare[['coverage', 'performance']] = [nx.community.partition_quality(G_football, med) for med in methods]
compare['density'] = [avg_measure(G_football, med, task_2b_solution) for med in methods]
compare['separability'] = [avg_measure(G_football, med, task_2a_solution) for med in methods]
compare

# Task 4b:

Fill in the solution strings below as you did in task 4a. Have any answers changed?

In [None]:
task_id = '4b'


In [None]:
def task_4b_solution():
    highest_modularity = None
    highest_coverage = None
    highest_performance = None
    highest_density = None
    highest_separability = None

    # YOUR CODE HERE
    raise NotImplementedError()

    return highest_modularity,highest_coverage, highest_performance, highest_density, highest_separability

In [None]:
# Use this cell to explore your solution.

task_4b_solution()



In [None]:
print(f"Task {task_id} - AG tests")
stu_ans = task_4b_solution()

print(f"Task {task_id} - your answer:\n{stu_ans}")

assert all(
    isinstance(num, str) for num in stu_ans
), (
    f"Task {task_id}: You should return a tuple of strs. "
)

#hidden tests for Question 13 are within this cell

# Task 4c: Community analysis: 

Using the method best aligned with modularity, which nodes are in the community with 'NewMexico'? The function below should return a list of integers, where each integer is the id of a node representing a school included in the community with 'NewMexico.' 'NewMexico' itself should appear in the list. 

Hint(s):
- 'NewMexico' has a node id of 4.




In [None]:
task_id = '4c'



In [None]:

def task_4c_solution():

    communities = []
    newmexico_community = []


    # YOUR CODE HERE
    raise NotImplementedError()

    return newmexico_community



In [None]:
# Use this cell to explore your solution.

task_4c_solution()

In [None]:
stu_ans = task_4c_solution()

assert isinstance(
    stu_ans, list
), f"Your function should return a list. "

assert isinstance(
    stu_ans[0], int
), f"Your function should return a list of ints. "


Below, we print out the values and names of the schools we found in the previous question. As you may recall, the 'value' represents the conference the school belongs to. We'd expect schools in the same conference to play each other and therefore be in the same community. Is that what happened here? 7 and 8 correspond to the Mountain West and Pacific Ten conferences, respectively.

In [None]:
def node_data():
    id_to_value = nx.get_node_attributes(G_football, 'value')
    id_to_label = nx.get_node_attributes(G_football, 'label')
    

    schools = task_4c_solution()
    # This section of code converts communities from a list of sets containing node numbers
    # into a list of sets containing the corresponding college names.
    data = [(id_to_label[node_id], id_to_value[node_id]) for node_id in schools]


    return data


node_data()
    