# CSS Lab: Networks

In [None]:
%pylab inline
import json
import math
import networkx as nx
import networkx.algorithms as nxalg
import networkx.algorithms.community as nxcom
import networkx.readwrite as nxrw
import pandas as pd
import visJS2jupyter.visJS_module as vjs

## 1. Build and visualize a network

This section loads network data from a file and explores its basic properties. 

In [None]:
# Helper functions

def load_sample_affiliation(filename="external/hamilton.csv"):
    B = nxrw.adjlist.read_adjlist(filename, delimiter="; ", comments="%")
    return B

def load_sample(filename="external/hamilton.csv"):
    # Load the song-character affiliation network
    B = load_sample_affiliation()
    # Get list of songs from the file
    songs = set()
    with open(filename) as f:
        f.readline()
        for row in f:
            songs.add(row.split("; ")[0])
    # Deduce list of charactes
    characters = set(B.nodes()) - songs
    # Project the affiliation network onto the set of characters
    G = nxalg.bipartite.projection.weighted_projected_graph(B, characters)
    return G

def visualize_visjs(
        G, communities=None, colors=None, default_color="#666666",
        node_size_field="node_size", layout="spring", scale=500):
    # Get list of nodes and edges
    nodes = list(G.nodes())
    edges = list(G.edges())
    # Generate a layout for the nodes
    if layout == "circle":
        pos = nx.circular_layout(G, scale=scale)
    else:
        pos = nx.spring_layout(G, k=3/math.sqrt(len(nodes)), scale=scale)
    # If we have communities, assign color based on community
    node_colors = {}
    if colors is None:
        colors = ["#00ff00", "#0000ff", "00ffff", "7f7f00", "ff0000", "7f7fff"]
    node_community = {}
    if communities is not None:
        for i, com in enumerate(sorted(communities, key=lambda x: len(x), reverse=True)):
            for node in com:
                node_community[node] = i
                try:
                    node_colors[node] = colors[i]
                except IndexError:
                    node_colors[node] = default_color
    # Per-node properties
    nodes_dict = [{
        "id": n,
        "x": pos[n][0],
        "y": pos[n][1],
        "node_size_field": "node_size",
        "node_size": 5,
        "color": node_colors.get(n, "#666666")}
        for n in nodes]
    # Map node labels to contiguous ids
    node_map = dict(zip(nodes,range(len(nodes))))
    # Determine edge colors
    edge_colors = {}
    for edge in edges:
        source_color = node_colors.get(edge[0], "#d0d0d0")
        target_color = node_colors.get(edge[1], "#d0d0d0")
        if source_color == target_color:
            edge_colors[edge] = source_color
    # Per-edge properties, use contiguous ids to identify nodes
    edges_dict = [{
        "source": node_map[edges[i][0]],
        "target": node_map[edges[i][1]],
        "title":'test',
        "color": "#d0d0d0"} #edge_colors.get(edge, "#d0d0d0")}
        for i in range(len(edges))]
    return vjs.visjs_network(
    nodes_dict, edges_dict,
    node_size_multiplier=10.0)

### Loading the network

The next cell loads data from a file using the `networkx` library,
and displays a list of nodes in the network.
This example uses characters from the play _Hamilton_.

In [None]:
G = load_sample()
sorted(G.nodes())

Now that you know the labels of the nodes, you can see which nodes are connected by an edge.
In this case, two nodes are connected by an edge if the corresponding characters have parts in the same song.
The next cell chooses a single node and prints a list of all the other nodes it's connected to.
These nodes are called its neighbors.

In [None]:
sorted(G.neighbors('E. Schuyler'))

### Visualizing the network
In these visualizations, each circle represents a node.
Edges between two nodes are represented by drawing a line between them.

There are many ways to draw a network.
One simple way is to space all the nodes evenly around a circle.

In [None]:
visualize_visjs(G, layout="circle")

Another common way to visualize a network is using a "force-directed" layout.
In a force-directed layout, nodes push away from each other, but edges act like springs pulling them back together.
As a result, nodes with many neighbors in common are pulled closer to each other.

In [None]:
visualize_visjs(G, scale=1000)

What do the people in the center of the network have in common? What about the people around the edge?

What are some benefits and drawbacks of the circular layout versus the force-directed layout?

## 2. Centrality measures

One benefit of representing data as a network is that the patterns of connections between nodes can reveal useful information.
Many standard techniques for investigating the structure of networks have been developed.

One of the simplest questions to ask is: which nodes are most important?
But what does "important" mean exactly?
There are several common ways to measure importance, or _centrality_, of nodes in a nework.
This section examines several of the most popular.

The next cell creates a data frame with the network nodes and then uses `networkx` to calculate their centralities.

In [None]:
df = pd.DataFrame({"id": G.nodes(), "label": G.nodes()}).set_index("id")
df['degree'] = pd.Series(nx.degree_centrality(G))
df['betweenness'] = pd.Series(nx.betweenness_centrality(G))
df['closeness'] = pd.Series(nx.closeness_centrality(G))
df['eigenvector'] = pd.Series(nx.eigenvector_centrality(G))

### Degree

One very simple way to find important nodes is to count how many neighbors they have.
This measure is called the degree centrality.
This number is typically divided by the total number of other nodes in the network, so a value
of 0.82 means that a node is connected to 82% of the other nodes.
The next cell shows the nodes with the highest degree centralities.

In [None]:
df.sort_values('degree', ascending=False).head(5)

### Betweenness

Rather than highly-connected nodes, you might want to find nodes that connect different parts of the network.
These types of nodes are sometimes called bridges, or brokers.
The betweenness centrality is based on finding the shortest path between nodes.
The nodes on that path play the role of bridges, connecting the endpoints.
So the betweenness is the fraction of all shortest paths in the network that pass through a given node.

In [None]:
df.sort_values('betweenness', ascending=False).head(5)

### Closeness

One very simple way to find important nodes is to count how many neighbors they have.
This measure is called the degree centrality.
This number is typically divided by the total number of other nodes in the network, so a value
of 0.82 means that a node is connected to 82% of the other nodes.
Here are the nodes with the highest and lowest degree centrality.

In [None]:
df.sort_values('closeness', ascending=False).head(5)

### Eigenvector

One very simple way to find important nodes is to count how many neighbors they have.
This measure is called the degree centrality.
This number is typically divided by the total number of other nodes in the network, so a value
of 0.82 means that a node is connected to 82% of the other nodes.
Here are the nodes with the highest and lowest degree centrality.

In [None]:
df.sort_values('eigenvector', ascending=False).head(5)

## SCRATCH SPACE BELOW HERE

## Build and visualize a network

In [None]:
def load_lost_circles_json(in_file):
    with open(in_file) as f:
        raw = json.load(f)
    id_to_name = dict((i, datum["name"]) for i, datum in enumerate(raw['nodes']))
    edges = [(datum["source"], datum["target"]) for datum in raw['links']]
    return id_to_name, edges

In [None]:
in_file = "external/LostCircles/sample.json"
id_to_name, edges = load_lost_circles_json(in_file)

In [None]:
G = nx.Graph()
G.add_edges_from(edges)
giant_component = max(list(nxalg.connected_components(G)), key=len)
for node in (set(G.nodes()) - set(giant_component)):
    G.remove_node(node)

In [None]:
def visualize_nx(G, communities=None, colors=None, default_color="#666666", node_size_field="node_size"):
    # Get list of nodes and edges
    nodes = list(G.nodes())
    edges = list(G.edges())
    # Generate a layout for the nodes
    pos = nx.spring_layout(G, k=1/math.sqrt(len(nodes)), scale=1)    # If we have communities, assign color based on community
    node_colors = {}
    if colors is None:
        colors = ["#00ff00", "#0000ff", "00ffff", "7f7f00", "ff0000", "7f7fff"]
    node_community = {}
    if False:
        if communities is not None:
            for i, com in enumerate(sorted(communities, key=lambda x: len(x), reverse=True)):
                for node in com:
                    node_community[node] = i
                    try:
                        node_colors[node] = colors[i]
                    except IndexError:
                        node_colors[node] = default_color
        # Per-node properties
        nodes_dict = [{
            "id": id_to_name[n],
            "x": pos[n][0],
            "y": pos[n][1],
            "node_size_field": "node_size",
            "node_size": 5,
            "color": node_colors.get(n, "#666666")}
            for n in nodes]
        # Map node labels to contiguous ids
        node_map = dict(zip(nodes,range(len(nodes))))
        # Determine edge colors
        edge_colors = {}
        for edge in edges:
            source_color = node_colors.get(edge[0], "#d0d0d0")
            target_color = node_colors.get(edge[1], "#d0d0d0")
            if source_color == target_color:
                edge_colors[edge] = source_color
        # Per-edge properties, use contiguous ids to identify nodesz
        edges_dict = [{
            "source": node_map[edges[i][0]],
            "target": node_map[edges[i][1]],
            "title":'test',
            "color": "#d0d0d0"} #edge_colors.get(edge, "#d0d0d0")}
            for i in range(len(edges))]
    figure(figsize=(128,128))
    nx.draw(G, pos, node_size=500, edge_color="#d0d0d0")

In [None]:
visualize_nx(G)

In [None]:
visualize_visjs(G)

## Calculate centralities

In [None]:
degree = nx.degree_centrality(G)
betweenness = nx.betweenness_centrality(G)
eigenvector = nx.eigenvector_centrality(G)
closeness = nx.closeness_centrality(G)

In [None]:
degree = nx.degree_centrality(G)
betweenness = nx.betweenness_centrality(G)
eigenvector = nx.eigenvector_centrality(G)
closeness = nx.closeness_centrality(G)nodes = G.nodes()
df = pd.DataFrame({
    "id": nodes,
    "label": [id_to_name[n] for n in nodes],
    "degree": [degree[n] for n in nodes],
    "betweenness": [betweenness[n] for n in nodes],
    "eigenvector": [eigenvector[n] for n in nodes],
    "closeness": [closeness[n] for n in nodes]}).set_index("id")

In [None]:
df.sort_values("degree", ascending=False).head()

In [None]:
df.sort_values("betweenness", ascending=False).head()

In [None]:
df.sort_values("eigenvector", ascending=False).head()

In [None]:
df.sort_values("closeness", ascending=False).head()

In [None]:
measures = ["degree", "betweenness", "eigenvector", "closeness"]
plt.figure(figsize=(8,8))
for row in range(4):
    for col in range(4):
        plt.subplot(4,4,1 + row*4 + col)
        if row != col:
            plt.plot(df[measures[row]], df[measures[col]], '.', markersize=1)
        if row == 3:
            plt.xlabel(measures[col])
        if col == 0:
            plt.ylabel(measures[row])
plt.tight_layout()

## Find communities

In [None]:
communities = []
# Girvan Newman takes about 4 hours to run on ~1000 node social net
for i, com in enumerate(nxcom.girvan_newman(G)):
    print("Found {}:".format(i))
    most_recent = com
    communities.append(com)

In [None]:
list_communities = []
for com in communities:
    list_communities.append([list(x) for x in com])
with open("communities.json","wb") as f:
    f.write(json.dumps(list_communities).encode('utf-8'))

In [None]:
with open("communities.json") as f:
    communities = json.loads(f.read())
for i, com in enumerate(communities):
    communities[i] = sorted(communities[i], key=lambda x: len(x), reverse=True)

In [None]:
visualize_visjs(G, communities[5])

## References

[RLWGBF2017] Rosenthal, S. B., Len, J., Webster, M., Gary, A., Birmingham, A., & Fisch, K. M. (2017). Interactive network visualization in Jupyter notebooks: visJS2jupyter. Bioinformatics.

[BS2016] A. Beveridge and J. Shan, "Network of Thrones," Math Horizons Magazine , Vol. 23, No. 4 (2016), pp. 18-22