# Extra problem - a mystery graph

Let's use what we've learned about EDA in Part 1 to figure out what kind of data we have from the data itself.

The airport graph, as well as most social type graphs, have some common structure: 
* **power-law** degree distribution
* **small world** behaviour (a.k.a. 6-degree of separation)
    
In the `Datasets/Example` directory, there are two files:
* edges: 2-ples, undirected, unweighted edges (no header)  (`.ncol` format)
* nodes: csv file with node attributes (with header)

Beware: node names are integers


In [None]:
#### path to the datasets
datadir='../Datasets/'

## required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import igraph as ig
import partition_igraph
from collections import Counter
from sklearn.metrics import adjusted_rand_score as ARI
from sklearn.metrics import adjusted_mutual_info_score as AMI


#### 1. Build an undirected, unweighed graph using the 'edges' file and remove loops; how many nodes/edges are there?

**Hint:** there are various ways to create an igraph by reading from a file via methods of the form `ig.Graph.Read*`. You'll want to find the method that works with `.ncol` format.

#### 2. Plot a histogram of the degree distribution

#### 3. Find the mean/max path lengths from a few nodes to all nodes. From the results and the previous step ... what do you think this graph could be?

#### 4. Load the node attributes, which include lat/lon.  Use those to plot the graph, what now?

**Note:** There are more nodes in the nodes file than appear in the edges file

### Possible Solutions

In [None]:
gr = ig.Graph.Read_Ncol(datadir+'Example/edges', directed=False, )
print(gr.vcount(), 'nodes and',gr.ecount(),'edges')


In [None]:
## 1. read edge list and build undirected simple graph
gr = ig.Graph.Read_Ncol(datadir+'Example/edges', directed=False)
gr = gr.simplify()
print(gr.vcount(), 'nodes and',gr.ecount(),'edges')
## vertex names are integers stored as strings
print(gr.vs['name'][:10])

In [None]:
## 2. degree distribution - we see mostly small values ...
print('max:',np.max(gr.degree()))
plt.hist(gr.degree(), bins=16);

In [None]:
## 3. shortest paths for some a few node(s) -- much larger values than the airport graph ...
print('number of nodes:',gr.vcount())
for v in [0,1000]:
    print("\nlooking at node:",v)
    sp = gr.distances(source=v)
    print('number of unreacheable nodes:',sum([i == np.inf for i in sp[0]]))
    print('mean number of hops to other nodes:',np.mean([i for i in sp[0] if i != np.inf ]))
    print('max number of hops to other nodes:',np.max([i for i in sp[0] if i != np.inf ]))


In [None]:
## 4. read node attributes
D = pd.read_csv(datadir+'Example/nodes')
print(D.shape) 
## nb: there are more nodes here than in the graph (13844) ... 
D.head()

In [None]:
## We have attribute by vertex name -- map to proper indices
lookup = {str(k):v for v,k in enumerate(D['name'])}
l = [lookup[x] for x in gr.vs()['name']]

## store layout attributes in graph and plot
## nb: we use negative latitude for layout due to location of origin
gr.vs['layout'] = [(D['lon'][i],-D['lat'][i]) for i in l]
ig.plot(gr, bbox=(500,400), layout = gr.vs['layout'], vertex_size=3, vertex_color='lightblue', margin=50)


### Europe Electric Grid

Network of high voltage grid in Europe. Vertices are stations and edges are the lines connecting them.
More details at: https://zenodo.org/record/47317#.Xt6nzy3MxTY
