## Required packages

See the README for instructions on how to set up and activate a minimal conda environment that includes the packages necessary to run this notebook. 

In [None]:
## path to the datasets
datadir='../Datasets/'

## required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import igraph as ig
import partition_igraph
from collections import Counter

# Part 1 - Basic Concepts and Exploratory Data Analysis (EDA)

igraph is a very powerful tool for graph analysis, is scalable (C background) and has R, Python and now Mathematica interfaces. We will introduce how to work with graphs in igraph in this section. 

**Note:** While we will use also igraph for visualization today, igraph is NOT the best tool for graph visualization, there are more specialized tools for this such as Graphviz, Bokeh, etc. 

## 1.1 Relational data as graphs

We will use the open source US airport graph as our main dataset:

* 2008 air traffic in the USA
* 464 nodes representing airports
* 12,000 edges representing travel from airport A to airport B, weighted by number of passengers 
* Some extra node attributes (city, state, lat/lon)

First, let's load the nodes (vertices), in this case the list of airports.

Those can be represented in a simple **data frame** as a list of objects with some features.

In [None]:
airport_df = pd.read_csv(datadir + 'Airports/airports_loc.csv')
airport_df.head()

What makes the data **relational** is that we also have **pairwise relations** between the airports, namely the **volume of passengers**. 

Those are the **edges** in our graph/network. 

Some initial questions to ask when looking at (pairwise) relational data:
* is there a sense of **direction** to the relationship? (directed vs undirected graphs)
    * ex: direction of travel from airport A to B
    * ex: I follow you vs. we are friends

* are some ties **stronger** than others? This is usually modelled in one of two ways:
  * **weights**: higher weights can mean stronger ties (ex: number of passengers; number of common friends);
  * **distance**: smaller distance can mean stronger ties (ex: commute time between cities; resistance in an electric circuit)

* can there be a relationship with oneself? This is modelled by **loops** 
  * ex: flight back to the same airport; 

* are there other **attributes**? This can either be of the nodes or the edges.
  * ex: city, state as a node attributes
  
The edges for the airport graph are the connections between the airports. Let's start by loading them into a pandas dataframe. 

In [None]:
df_edges = pd.read_csv(datadir + 'Airports/connections.csv')
df_edges.head() ## look at a few edges

### Build a weighted directed graph from the edges

One way to create a graph in igraph is by handing it a list of edges in tuples of the form `(source, target, weight)`. 

We'll convert the above dataframe to a list of tuples and create a directed, weighted graph from it. 

In [None]:
tuple_list = [tuple(x) for x in df_edges.values]
g = ig.Graph.TupleList(tuple_list, directed=True, edge_attrs=['weight'])

### Graph Objects in igraph

A graph in igraph consists of:
* a **vertex** sequence object, with 0-based indices
* an **edge** sequence object, each connecting a 2-ple of vertices

Note: node and vertex mean the same thing. igraph uses the language of vertices and we will use node and vertex interchangeably.

**WARNING**: node names, if stored, can also be integers and may not correspond to node indices.


### Vertices

Vertices can be accessed by index (NOT the same thing as accessing the vertex by the name)

In [None]:
g.vs[0]

It's useful to be able to `.find()` a vertex given its name (be careful if names are also integers!)

In [None]:
g.vs.find('LAX')

In [None]:
g.vs.find('LAX').index

`.find()` returns the first vertex given some conditions

`.select()` returns all vertices given some condition

```
Keyword arguments can be used to filter the vertices based on their attributes. The name of the keyword specifies 
the name of the attribute and the filtering operator, they should be concatenated by an underscore character. 
Possible operators are:

    eq: equal to
    ne: not equal to
    lt: less than
    gt: greater than
    le: less than or equal to
    ge: greater than or equal to
    in: checks if the value of an attribute is in a given list
    notin: checks if the value of an attribute is not in a given list
```


For example, we can use `.select()` to see here that there are no airports with `abc` in it.

In [None]:
len(g.vs.select(name = 'abc'))

And multiple airports starting with `Y` or `Z`

In [None]:
len(g.vs.select(name_ge = 'Y'))

In [None]:
[v['name'] for v in g.vs.select(name_ge = 'Y')]

The vertex sequence is a python iterable, so things like list comprehension work on it:

In [None]:
[v for v in g.vs[:5]]

Any vertex attribute may be added. In this case, the vertex set is used as a dictionary where the keys are the
attribute names. The values corresponding to the keys are the values of the given attribute for every vertex selected by the sequence. 

In [None]:
g.vs[0]['color'] = ['green']
[v for v in g.vs][:5]

If you specify a sequence that is shorter than the number of vertices in
vertex sequence, the sequence is reused:

In [None]:
g.vs['color'] = ['lightblue', 'pink', 'purple']
[v for v in g.vs][:5]

For later visualization, let's reset all of the vertices to `lightblue`.

In [None]:
g.vs['color'] = 'lightblue'

It's easy to access the number of vertices, in this case, the number of airports. 

In [None]:
g.vcount()

### Edges

Edges are  accessed by index in the edge sequence

In [None]:
g.es[0]

Edges have a tuple of vertex indices of an edge representing `(source, target)`

In [None]:
g.es[0].tuple

In [None]:
g.es[0].source, g.es[0].target

Let's look up the details of the edge. 

In [None]:
e_idx = 0

source_idx = g.es[e_idx].tuple[0]
target_idx = g.es[e_idx].tuple[1]

source_name = g.vs[source_idx]['name']
target_name = g.vs[target_idx]['name']

edge_weight = g.es[e_idx]['weight']

print(source_name, '--->' ,target_name,'has weight',edge_weight)

In [None]:
# package this in a function
def print_edge_details(g, e_idx):
    """Helper function taking a graph and edge index to display edge information"""
    print(g.vs[g.es[e_idx].tuple[0]]['name'], '--->',
          g.vs[g.es[e_idx].tuple[1]]['name'], 'has weight',g.es[e_idx]['weight'])

In [None]:
print_edge_details(g, 0)

Is there an edge in the other direction? We can check by looking up an edge by vertex ids.

In [None]:
g.are_connected(target_idx, source_idx)

Yes, there's an edge. Let's check it's index in the edge sequence. 

In [None]:
rev_e_idx = g.get_eid(target_idx, source_idx)
rev_e_idx

In [None]:
print_edge_details(g, rev_e_idx)

Some routes are only one-way

In [None]:
g.vs.find('BMI').index

Let's check BMI -> SFO (100 to 0)

In [None]:
g.are_connected(100, 0)

It's connected, so let's get an edge from BMI to SFO

In [None]:
g.es[g.get_eid(100,0)]

What about in the other direction, SFO -> BMI?

In [None]:
g.are_connected(0, 100)

What if we want to find all routes orignating from SFO? We can do this by asking which edges this vertex is **incident** on.

In [None]:
len(g.incident(0, mode='out'))

There are a lot. Let's see what a few of them are. 

In [None]:
n = 5
for e_idx in  g.incident(0, mode='out')[:n]:
    print_edge_details(g, e_idx)

Number of edges

In [None]:
g.ecount()

### Attributes

A common **edge attribute** is the edge weight, or distance. 

For the airport dataset, we also have a few **node attributes** that we already saw:
 * City and state
 * Latitude and longitude (useful for nice layout)

Let's add them to the graph.

In [None]:
airport_df.head()

We have attributes by airport code, which is our vertex names. 

We need to lookup the vertex indices to add the attributes. 

In [None]:
for index, row in airport_df.iterrows():
    v = g.vs.find(row['airport'])
    v['state'] = row['state']
    v['city'] = row['city']
    v['layout'] = (row['lon'],-row['lat'])    

A faster mehod is to build a lookup dictionary to help with transitioning between the dataframe indices and the vertex sequence index of our vertices

In [None]:
lookup = {k:v for v,k in enumerate(airport_df['airport'])}
l = [lookup[x] for x in g.vs()['name']] ## order of nodes in our graph

In [None]:
## sanity check
v = 0
print('vertex',v,':',g.vs[v]['name'], 'is at row', l[v])
print(airport_df.loc[l[v]])

Let's add the attributes to the graph.

nb: we use layout = (longitude, -latitude) due to location of origin

In [None]:
g.vs['layout'] = [(airport_df['lon'][i],-airport_df['lat'][i]) for i in l]
g.vs['state'] = [airport_df['state'][i] for i in l]
g.vs['city'] = [airport_df['city'][i] for i in l]

Let's look at one vertex now, to see the new attributes.

In [None]:
g.vs[0]

### Subgraphs and different types of graphs
To make our next analysis easier, we'll work off of a small subgraph.

In [None]:
subgraph_nodes = [g.vs.find(name='LAX').index,
                  g.vs.find(name='SFO').index,
                  g.vs.find(name='OAK').index]
subgraph_nodes

It's easy to get an induced subgraph from a list of nodes

In [None]:
sg = g.subgraph(subgraph_nodes)

In [None]:
ig.plot(sg, bbox=(250,250), vertex_label=sg.vs['name'], 
        vertex_label_size=8, margin=50)

What if we want to know how many people travelled between the airports? In this case we don't care about the direction. We can do this by creating an undirected weighted graph, where the new weight are the summed edge weights from before.

In [None]:
sg_und = sg.as_undirected(combine_edges=sum)

In [None]:
ig.plot(sg_und, bbox=(250,250), vertex_label=sg.vs['name'], 
        vertex_label_size=8, margin=50)


If we want to get rid of the loops, we can use `simplify`.

In [None]:
sg = sg.simplify(combine_edges=sum)

In [None]:
ig.plot(sg, bbox=(250,250), vertex_label=sg.vs['name'], 
        vertex_label_size=8, margin=50)

Combining the above, we get a simple, undirected graph. 

In [None]:
ig.plot(sg.as_undirected(combine_edges=sum),bbox=(250,250), vertex_label=sg.vs['name'], 
        vertex_label_size=8, margin=50)


We've been carrying the weights through, so let's visualize them by setting an edge width attribute. Here we get a weighted digraph.

In [None]:
sg.es['width'] = [int(np.log10(x)+1) for x in sg.es['weight']]
ig.plot(sg, bbox=(250,250), vertex_label=sg.vs['name'], 
        vertex_label_size=8, margin=50)

We can also print some edge labels, in this case we'll display the edge weights as the labels.

In [None]:
sg.es['width'] = 1
sg.es['label'] = sg.es['weight']

ig.plot(sg, bbox=(300,300), vertex_label=sg.vs['name'], 
        vertex_label_size=8, margin=50, edge_label_size=7)

### Plotting in igraph

* igraph uses **cairo** for plotting, along with a python interface such as **pycairo** or **cairocffi**.
* latest versions of igraph can also use **matplotlib** (see example below)
* graph can be exported as **networkx** format which can be used in **bokeh** for interactive plotting
* another options is to save the graph in DOT format used in **GraphViz**


Here's an example of how to multiplot with matplotlib

In [None]:
fig, ((ax1,ax2),(ax3,ax4)) = plt.subplots(2,2, figsize=(8,8))
sg = g.subgraph(subgraph_nodes)
ig.plot(sg.as_undirected(combine_edges=sum), bbox=(250,250), vertex_label=sg.vs['name'], 
        vertex_label_size=8, margin=50, target=ax1)
sg = sg.simplify(combine_edges=sum)
ig.plot(sg.as_undirected(combine_edges=sum),bbox=(250,250), vertex_label=sg.vs['name'], 
        vertex_label_size=8, margin=50, target=ax2)
ig.plot(sg, bbox=(250,250), vertex_label=sg.vs['name'], 
        vertex_label_size=6, margin=50, target=ax3)
sg.es['width'] = [int(np.log10(x)+1) for x in sg.es['weight']]
ig.plot(sg, bbox=(250,250), vertex_label=sg.vs['name'], 
        vertex_label_size=6, margin=50, target=ax4);

## Other Representations of Graphs

Here's how to get from other representation of graphs from our igraph object:
* graph as a list of edges (dataframe)
* graph as an adjacency matrix
* graph as a weighted adjacency matrix

In [None]:
## here, DNW stands for directed, named, weighted
sg.summary()

#### List of edges
We'll export to pandas using `.get_edge_dataframe()` to do this

In [None]:
df = sg.get_edge_dataframe()

In [None]:
df

Now let's replace node ids with names

In [None]:
df_vert = sg.get_vertex_dataframe()
df['source'].replace(df_vert['name'], inplace=True)
df['target'].replace(df_vert['name'], inplace=True)
df.sort_values(by='weight', ascending=False, inplace=True)
df

We could also have built the dataframe column by column instead:

In [None]:
df = pd.DataFrame()
df['from'] = [sg.vs[e.tuple[0]]['name'] for e in sg.es]
df['to'] = [sg.vs[e.tuple[1]]['name'] for e in sg.es]
df['weight'] = [e['weight'] for e in sg.es]
df.sort_values(by='weight', ascending=False, inplace=True)
df

#### Adjacency matrix (binary)

In [None]:
print(sg.get_adjacency())

#### Weighted adjacency matrix

In [None]:
print(sg.get_adjacency(attribute='weight'))

### Questions
Try to answer the following questions using the graph objects that we've created.

#### 1. How many airports are in California (CA)?


#### 2. Which 5 states have the most airports? the least?
Hint: we've already imported `Counter` from `collections`

In [None]:
# most


In [None]:
# least


#### 3. Which airport is the southernmost? northernmost?

In [None]:
# southernmost


In [None]:
# northernmost 


#### 4. How many routes have at least 1 million passengers?



#### 5. Which route is the busiest one-way? both ways?


In [None]:
# one-way


In [None]:
# both ways


### Possible Solutions

In [None]:
## airports in CA, two ways
print('airports in CA:', 
      len([v for v in g.vs if v['state'] == 'CA']), 
      sum(airport_df['state']=='CA'))

## states with most airports
print('\nmost airports:',Counter(g.vs['state']).most_common(5))

## states with the least airports
print('\nleast airports:',Counter(g.vs['state']).most_common()[-5:])

## north/south
latitude = [-x[1] for x in g.vs['layout']]
v = np.argmin(latitude)
print('\nsouthernmost:',g.vs[v],'\n')
v = np.argmax(latitude)
print('northernmost:',g.vs[v])

## 1M+ connections
print('\n1M+ connections:',len([e for e in g.es if e['weight'] >= 1000000]),'\n')

## busiest route (1-way) 
e = np.argmax(g.es['weight'])
print_edge_details(g, e)

## busiest route (2-way) 
g_und = g.as_undirected(combine_edges=sum)
e = np.argmax(g_und.es['weight'])
print('\n2-way:', g_und.vs[g_und.es[e].tuple[0]]['name'], '---',  g_und.es[e]['weight'], '---', 
      g_und.vs[g_und.es[e].tuple[1]]['name'],)


## 1.2 Exploratory Data Analysis (EDA)

### Discussion

What makes network/graph data challenging?

* The points are connected and can not be considered as independent samples
* Inference on graphs requires the topological structure: node, edge, neighbourhoods, etc.
* The node's topological roles can be highly variable (degree, betweenness, centrality, etc.)


### Visualization

Let's take a look at the entire graph. 

First, using a **force directed layout**, we see a small **disconnected component**, this is not uncommon; 

In social networks and many other types of graphs, a common situation is to have a **giant component** with most nodes, and some small components.

In [None]:
ig.plot(g, vertex_size=5, edge_arrow_size=.5, edge_color='grey', 
        layout=g.layout_fruchterman_reingold(), bbox=(500,400))

Next let's use the latitutde and longitude based layout...this reveals a familiar shape.

In [None]:
ig.plot(g, vertex_size=5, edge_arrow_size=.5, edge_color='grey', 
        layout=g.vs['layout'], bbox=(500,400))

This is a LOT to look at. We can look at the subgraph induced by a single state. 

In [None]:
st = 'MN'
sg = g.subgraph([v for v in g.vs if v['state'] == st])
ig.plot(sg,bbox=(450,350), vertex_label=sg.vs['name'], vertex_size=15,layout=sg.vs['layout'],
            vertex_label_size=6, margin=50)

### Node degree(s)

A key node feature is **degree**.

With a **directed graph**, we can distinguish 4 concepts of degree (!)

* in-degree: number of edges coming in
* out-degree: number of edges going out
* total degree: sum of the above 2 quantities
* (undirected) degree: degree when reducing to **undirected** graph

Let's look at all four degrees in our airport graph.


In [None]:
df = pd.DataFrame()
df['node'] = g.vs['name']
df['in-deg'] = g.degree(mode='in')
df['out-deg'] = g.degree(mode='out')
df['total-deg'] = g.degree(mode='all')
df['und-deg'] = g.as_undirected().degree()

In [None]:
df.loc[df['node'].isin(['SFO','LAX','OAK','OPF'])]

### Ego-net of a node

The **ego-net** of a node is the subgraph induced by a node and its neighbours. Let's take a look the ego-net of OPF, the Miami executive airport.



In [None]:
sg = g.induced_subgraph(g.neighborhood(g.vs.find(name='OPF')))

In [None]:
## with multiple plots, we'll set some parameters for re-use
visual_style = {}
visual_style["vertex_label_size"] = 8
visual_style["bbox"] = (300, 300)
visual_style["margin"] = 50


In [None]:
ig.plot(sg, **visual_style, vertex_label=sg.vs['name'])

Here's the undirected version

In [None]:
sg = sg.as_undirected()
ig.plot(sg, **visual_style, vertex_label=sg.vs['name'])

### Weighted degree a.k.a. Strength

With **weighted** graphs, we also define:

* in-strength: sum of weights of all incoming edges
* out-strength: sum of weights of all outgoing edges
* total-strength: sum of the above two quantities

When converting a directed graph to undirected, it is common to add the edge weights, so we get the "total-strength". We check this below by computing the undirected strength (`und-str`) by combining edges by sum when converting from directed to undirected. 


In [None]:
df['in-str'] = g.strength(mode='in', weights='weight')
df['out-str'] = g.strength(mode='out', weights='weight')
df['total-str'] = g.strength(mode='all', weights='weight')
df['und-str'] = g.as_undirected(combine_edges=sum).strength(weights='weight')
df.loc[df['node'].isin(['SFO','LAX','OAK','OPF'])]

In [None]:
all(df['total-str'] == df['und-str'])

### Degree distribution

Also common in most social-type networks, degree distribution is far from uniform in the airport graph, with lots of low degree nodes, and a small number of high degree ones.

This is indicative of **power-law** degree distribution.

In such networks, shortest paths between connected nodes are typically extremely short (the **6-degree of separation** phenomenon).

In [None]:
plt.figure(figsize=(8,3))
plt.subplot(121)
plt.hist(g.degree(mode='in'), bins=30)
plt.title('in-degree')
plt.xlabel('degree')
plt.ylabel('frequency')
plt.subplot(122)
plt.hist(g.degree(mode='out'), bins=30)
plt.title('out-degree')
plt.xlabel('degree');

### Paths and connected components

Here are some basic definitions related to how connected nodes are with each other:

* A **path** is a sequence of edges connecting two nodes
* A **connected component**, is a subset of nodes such that there is a path between every pair of nodes in the subset 
* **Path length** is usually the number of edges (**hop count**), but weights can also be considered to define path length
* For directed graphs, we can define two notions of **connectivity** by depending on whether we take directionality into account (**strong** connectivity) or not (**weak** connectivity); those are the same in undirected graphs


In [None]:
print('strong connectivity:', g.connected_components(mode='strong').summary())

In [None]:
print('weak connectivity:',g.connected_components(mode='weak').summary())

In [None]:
g_und = g.as_undirected(combine_edges=sum) ## undirected graph, summing the weights
print('undirected connectivity:',g_und.connected_components().summary())

It can be interesting to consider where there are paths between nodes, and where there are not. Let's look at where we can get to starting from a given airport. Recall that there are 464 airports (number of nodes `.vcount()`).

Let's look at OPF and SFO and explore the shortest (directed) paths.

In [None]:
for ap in ["OPF","SFO"]:
    print("\nlooking at:", ap)
    v = g.vs.find(name=ap)
    sp = g.distances(source=v, mode='out')[0]
    print('number of "unreacheable" airports:',sum([i == np.inf for i in sp]))
    print('mean number of hops to other airports:',np.mean([i for i in sp if i != np.inf ]))
    print('max number of hops to other airports:',np.max([i for i in sp if i != np.inf ]))

###  Why are some airports "unreacheable" from SFO?

Two airports are in a separate (weak) connected component. What about the other 18? Let's see if we can identify why all of these airports are unreachable. To make these standout, let's colour unreachable nodes red.

In [None]:
unreachable = np.where([i == np.inf for i in sp])
for j in unreachable:
    g.vs[j]['color'] = 'red' 
unreachable

In [None]:
ig.plot(g, vertex_size=5, edge_arrow_size=.5, edge_color='grey', 
        layout=g.layout_fruchterman_reingold(), bbox=(500,400))

We see 2 airports in a small connected component 

In [None]:
g.vs['cc'] = g.connected_components('weak').membership
Counter(g.vs['cc'])

Let's remove those 2 nodes. We make a copy of the graph to keep the original graph intact. 

In [None]:
v_list = [v['name'] for v in g.vs if v['cc']==1]
g_copy = g.copy()
g_copy.delete_vertices(v_list)

Now let's also remove airports that have no incoming links.

In [None]:
v_list = [v for v in g_copy.vs if g_copy.degree(v,'in') == 0]
print('removing',len(v_list),'more')
g_copy.delete_vertices(v_list)

There's still one such node left.

Let's recursively remove airport(s) without incoming links

In [None]:
v_list = [v for v in g_copy.vs if g_copy.degree(v,'in') == 0]
print('removing', len(v_list),'more')
g_copy.delete_vertices(v_list)

That seems to have got them all. Once we remove the other component and recursively remove airports with no incoming links, all the airports become reachable from SFO. 

In [None]:
for ap in ["SFO"]:
    print("\nlooking at:", ap)
    v = g_copy.vs.find(name=ap)
    sp = g_copy.distances(source=v, mode='out')[0]
    print('number of "unreacheable" airports:',sum([i == np.inf for i in sp]))
    print('mean number of hops to other airports:',np.mean([i for i in sp if i != np.inf ]))
    print('max number of hops to other airports:',np.max([i for i in sp if i != np.inf ]))

### Questions

#### 1. Which airport has the largest number of outgoing connections? incoming? total?

In [None]:
# outgoing


In [None]:
# incoming


In [None]:
# total


#### 2. Which airport has the largest number of passengers in total? the smallest?



In [None]:
# largest


In [None]:
# smallest


#### 3. What happens if we ignore direction of flights and consider: 
* the number of "unreacheable" airports from SFO, OPF
* the mean number of hops to other airports from SFO, OPF
* max number of hops to other airports from SFO, OPF

### Possible Solutions

In [None]:
## outgoing connections
x = np.argwhere(g.degree(mode='out') == np.max(g.degree(mode='out'))).flatten()
for v in x:
    print(g.vs[v]['name'],"has outgoing connections to",g.degree(v,'out'),"airports")

## incoming connections
x = np.argwhere(g.degree(mode='in') == np.max(g.degree(mode='in'))).flatten()
for v in x:
    print(g.vs[v]['name'],"has incoming connections from",g.degree(v,'in'),"airports")

## any connections
x = np.argwhere(g_und.degree() == np.max(g_und.degree())).flatten()
for v in x:
    print(g_und.vs[v]['name'],"has connections with",g_und.degree(v),"airports\n")

## largest number of passengers - total
v = np.argmax(g_und.strength(weights='weight'))
print(g_und.vs[v]['name'],"has",int(g.strength(v, weights='weight')),"total passengers")

## smallest number of passengers - total
v = np.argmin(g_und.strength(weights='weight'))
print(g_und.vs[v]['name'],"has",int(g.strength(v, weights='weight')),"total passengers")

## undirected short paths
for ap in ["OPF","SFO"]:
    print("\nlooking at:",ap)
    v = g.vs.find(name=ap)
    sp = g.distances(source=v, mode='all')
    print('number of unreacheable airports:',sum([i == np.inf for i in sp[0]]))
    print('mean number of hops to other airports:',np.mean([i for i in sp[0] if i != np.inf ]))
    print('max number of hops to other airports:',np.max([i for i in sp[0] if i != np.inf ]))
    

## 1.3 Node importance

We explore 3 concepts of node (vertex) importance in a graph:

* **coreness**: useful to prune nodes with low connectivity in a graph
* **centrality**: identify most influential nodes in various ways
* **betweenness**: identify nodes on geodesics (shortest paths) between several other nodes


### Coreness (k-cores)

The **k-core** of a graph is the maximal subgraph where all nodes have degree at least k. 

The **coreness** of a node is k if it belongs to the k-core, but not the (k+1)-core. 

This is usually done with the **undirected** degrees, but one can also look for 'in' and 'out' k-cores.

Below we compute coreness for the undirected version of the airport graph. We see:

* many nodes in large 50-core subgraph (which we plot)
* many nodes with small coreness (1 or 2)

Pruning nodes with small coreness often drastically reduces the size of the graph.


In [None]:
g_und = g_und.simplify(combine_edges=sum)
## most frequent coreness values
g_und.vs['core'] = g_und.coreness()
Counter(g_und.vs['core']).most_common(3)

In [None]:
plt.hist(g_und.vs['core'], bins=25);

Let's look at the largest k-core - plot with lat/lon layout.

We see that all of the nodes in the largest k-core are major airports.

In [None]:
sg = g_und.subgraph([v for v in g_und.vs if v['core'] == 50])
ig.plot(sg,bbox=(600,450), vertex_label=sg.vs['name'], vertex_size=15,layout=sg.vs['layout'],
            vertex_label_size=6, margin=50)

### Centrality, betweenness

There are various ways to define **centrality** of nodes in a network, such as:
* its **degree** or **strength** (weighted degree)
* **pagerank** (proportional to number of visits in random walks)

**Betweenness** measures the proportion of shortest paths (geodesics) going through each node.

Let's explore this with the California subgraph.

First induce the subgraph based on vertices in California.

In [None]:
sg = g.subgraph([v for v in g.vs if v['state']=='CA'])

Then keep only nodes with some connection within the state

In [None]:
sg = sg.subgraph([v for v in sg.vs() if v.degree()>0])

Drop loops

In [None]:
sg = sg.simplify(multiple=False)

And take a look:

In [None]:
print(sg.vcount(),'nodes and',sg.ecount(),'directed edges')
ig.plot(sg,bbox=(400,400), vertex_label=sg.vs['name'], vertex_size=15,layout=sg.vs['layout'],
            vertex_label_size=6, margin=50, edge_arrow_size=.33, edge_color='grey')

Compute a few things and sort with respect to pagerank scores

In [None]:
df = pd.DataFrame()
df['degree'] = sg.degree()
df['pagerank'] = sg.pagerank(weights='weight')
n = sg.vcount()
df['between'] = [2 * x / ((n - 1) * (n - 2)) for x in sg.betweenness()] ## normalized
df['state'] = sg.vs['state']
df['city'] = sg.vs['city']
df['name'] = sg.vs['name']
df.sort_values(by='pagerank', inplace=True, ascending=False)
df.head()

We see SAN having much lower beweenness than the two big hub airports

High degree nodes are typically more likely to have high centrality/betweenness.

In [None]:
plt.figure(figsize=(9,4))
plt.subplot(121)
plt.scatter(df['degree'],df['pagerank'])
plt.xlabel('degree')
plt.ylabel('pagerank')
plt.subplot(122)
plt.scatter(df['degree'],df['between'])
plt.xlabel('degree')
plt.ylabel('betweenness');

### Questions

There are nodes in the California subgraph with zero betweenness, but pagerank score above 0.04.

#### 1. Which nodes are they?

#### 2. To explore why this might be, plot the ego net of these nodes and their neighbours.

#### 3. What might explain the zero betweenness and pagerank above 0.04?

Recall: pagerank can be interpreted as visits from multiple random walks ...


### Possible Solutions

In [None]:
print(df[(df['between']==0) & (df['pagerank']>.04)])

## get all vertices in the neighbourhoods
v = set(sg.neighborhood(sg.vs.find('MCE'))).union(set(sg.neighborhood(sg.vs.find('VIS'))))

## plot induced subgraph
sg_ego = sg.subgraph(v)
ig.plot(sg_ego,bbox=(300,200), vertex_label=sg_ego.vs['name'], vertex_size=15,layout=sg_ego.vs['layout'],
            vertex_label_size=6, margin=50)

## those two nodes are disconnected from the rest!
## thus, betweenness must be zero
## pagerank is above zero as a walk starting at one of those nodes is trapped


## Bonus Material
See the "extra" notebook for Part 1