# CNA Homework

## Part 1

**Reddit Networks**

Using the [Reddit networks dataset](http://dynamics.cs.washington.edu/nobackup/reddit/), select the subreddit of your favorite TV show (or you can choose any other dataset, containing the network data). Answer the following questions:

In [None]:
#!pip install turicreate

In [1]:
!wget http://dynamics.cs.washington.edu/nobackup/reddit/theoffice.tar.gz

--2022-01-07 12:16:36--  http://dynamics.cs.washington.edu/nobackup/reddit/theoffice.tar.gz
Resolving dynamics.cs.washington.edu (dynamics.cs.washington.edu)... 128.208.3.120, 2607:4000:200:12::78
Connecting to dynamics.cs.washington.edu (dynamics.cs.washington.edu)|128.208.3.120|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3550974 (3.4M) [application/x-gzip]
Saving to: ‘theoffice.tar.gz’


2022-01-07 12:16:37 (2.98 MB/s) - ‘theoffice.tar.gz’ saved [3550974/3550974]



In [2]:
import tarfile
fname = 'theoffice.tar.gz'
tar = tarfile.open(fname, "r:gz")
tar.extractall(path="./data/")
tar.close()

In [3]:
import os
import turicreate as tc 
from tqdm.auto import tqdm
import matplotlib.pyplot as plt

g = tc.SGraph()
graphs_dir = 'data'
sframes_paths = [graphs_dir +'/' + s for s in os.listdir(graphs_dir)]
for folder in tqdm(sframes_paths):
    if not folder.endswith(".sgraph"):
        continue
    subG = tc.load_sgraph(folder)
    g = g.add_vertices(subG.get_vertices())
    g = g.add_edges(subG.get_edges())

  0%|          | 0/49 [00:00<?, ?it/s]

In [4]:
# converv SGraph to networkx, if needed

import networkx as nx

def sgraph2nxgraph(sgraph, is_directed=True, add_vertices_attributes=True, add_edges_attributes=True):
    if is_directed:
        nx_g = nx.DiGraph()
    else:
        nx_g = nx.Graph()
    if add_vertices_attributes:
        vertices = [(r['__id'] , r) for r in sgraph.vertices]
    else:
        vertices = list(sgraph.get_vertices()['__id'])

    if add_edges_attributes:
        edges = [(r['__src_id'], r['__dst_id'], r) for r in sgraph.edges]
    else:
        edges = [(e['__src_id'], e['__dst_id']) for e in sgraph.get_edges()]
    nx_g.add_nodes_from(vertices)
    nx_g.add_edges_from(edges)
    return nx_g

**Task 1 (_max score - 10 points_)**: Calculate and visualize the degree distribution of the vertices in the network

**Task 2 (_max score - 15 points_)**: Create a subgraph of the top-20 users according to the PageRank algorithm. Draw the subgraph.

**Task 3 (_max score - 15 points_)**: Visualize the distribution of the network's strongly and weakly connected components.

* As we didn't have the chance to review this in class, you can read shortly about these terms [here](https://www.geeksforgeeks.org/check-if-a-graph-is-strongly-unilaterally-or-weakly-connected/)
* This might be helpful: [networkx.weakly_connected_components](https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.components.weakly_connected_components.html?highlight=weakly_connected_components#networkx.algorithms.components.weakly_connected_components) 

**Task 4 (_max score - 10 points_)**: Split the network into communities, and find the __second__ most central vertex in each community (use degree_centrality).

## Part 2

In [None]:
import networkx as nx
from networkx.algorithms.centrality import *
from scipy.stats import spearmanr 
from networkx.generators.geometric import random_geometric_graph
from networkx.algorithms.community import * 

### Lets generate some networks

Every network contains four sets of nodes. $a\in \{0.1,0.01\}$ is the probability for an edge between two nodes in the same set. $b\in \{0.1,0.01\}$ is the probability for an edge between two nodes in different communities.

In [None]:
blocks = [100,100,100,100]
probs = [
        [[a,b,b,b],
         [b,a,b,b],
         [b,b,a,b],
         [b,b,b,a]]
    for a,b in [(0.01,0.01),(0.1,0.01),(0.01,0.1)]
    ]

In [None]:
nets = [nx.generators.community.stochastic_block_model(blocks,p) for p in probs*100]
print("There are {} networks in total.".format(len(nets)))

There are 300 networks in total.


### Node centrality **(_max score - 10 points_)**
Your code is here: replace [] accordingly to the comment in the row



In [None]:
centralities =  [ {
                    'degree':[], #replace [] with a sequence of node degree centralities 
                    'closeness':[], #replace [] with a sequence of node closeness centralities  
                    'betweenness': [] #replace [] with a sequence of node betweenness centralities
                  } 
                for G in nets
                ]

In [None]:
#here we compute the corralations between the three centrality measures for each network
#every network is characterized by the tripplet of centrality correlations
centrality_correlations = [
    (
        spearmanr(c['degree'],c['closeness'])[0],
        spearmanr(c['degree'],c['betweenness'])[0],
        spearmanr(c['betweenness'],c['closeness'])[0],
    )
    for c in centralities
]

### Build a meta-network

where networks are nodes connected by an edge if their centrality correlations are similar 

In [None]:
radius = 0.025
G = random_geometric_graph(n=len(nets), radius=radius, dim=3, pos=dict(enumerate(centrality_correlations)))

In [None]:
print("number of nodes in G is the same as the number of networks in nets:{}".format(G.number_of_nodes()))
print("number of edges in G is:{}".format(G.number_of_edges()))

number of nodes in G is the same as the number of networks in nets:300
number of nodes in G is:0


In [None]:
pos = nx.spring_layout(G)
nx.draw(G, pos=pos,node_size=5,alpha=0.2)

### Communities

#### What is the number of communities in nets[1]? **(_max score - 10 points_)**
- Use _greedy_modularity_communities_

In [None]:
#find the community structure of nets[1] and print the number of communities in nets[1]

#### How could you know it without running community detection? **(_max score - 10 points_)**

#### What is the number of communities in the meta-network G? **(_max score - 10 points_)**

In [None]:
# Find the number of communities in G

#### Can you explain why this is the number of communities in G? **(_max score - 10 points_)**